Ross, thanks for the feedback.  A couple points here -

A lot of work went into improving the error handling around build 77 of
Nevada.  There are still problems today, but a number of the
complaints we've seen are on s10 software or older nevada builds that
didn't have these fixes.  Anything from the pre-2008 (or pre-s10u5)
timeframe should be taken with grain of salt.

There is a fix in the immediate future to prevent I/O timeouts from
hanging other parts of the system - namely administrative commands and
other pool activity.  So I/O to that particular pool will hang, but
you'll still be able to run your favorite ZFS commands, and it won't
impact the ability of other pools to run.

We have some good ideas on how to improve the retry logic.  There is a
flag in Solaris, B_FAILFAST, that tells the drive to not try too hard
getting the data.  However, it can return failure when trying harder
would produce the correct results.  Currently, we try the first I/O with
B_FAILFAST, and if that fails immediately retry without the flag.  The
idea is to elevate the retry logic to a higher level, so when a read
from a side of a mirror fails with B_FAILFAST, instead of immediately
retrying the same device without the failfast flag, we push the error
higher up the stack, and issue another B_FAILFAST I/O to the other half
of the mirror.  Only if both fail with failfast do we try a more
thorough request (though with ditto blocks we may try another vdev
alltogether). This should improve I/O error latency for a subset of
failure scenarios, and biasing reads away from degraded (but not faulty)
devices should also improve response time.  The tricky part is
incoporating this into the FMA diagnosis engine, as devices may fail
B_FAILFAST requests for a variety of non-fatal reasons.

Finally, imposing additional timeouts in ZFS is a bad idea.  ZFS is
designed to be a generic storage consumer.  It can be layered on top of
directly attached disks, SSDs, SAN devices, iSCSI targets, files, and
basically anything else.  As such, it doesn't have the necessary context
to know what constitutes a reasonable timeout.  This is explicitly
delegated to the underlying storage subsystem.  If a storage subsystem
is timing out for excessive periods of time when B_FAILFAST is set, then
that's a bug in the storage subsystem, and working around it in ZFS with
yet another set of tunables is not practical.  It will be interesting to
see if this is an issue after the retry logic is modified as described
above.

Hope that helps,

- Eric

On Thu, Aug 28, 2008 at 01:08:26AM -0700, Ross wrote:
> Since somebody else has just posted about their entire system locking up when 
> pulling a drive, I thought I'd raise this for discussion.
> 
> I think Ralf made a very good point in the other thread.  ZFS can guarantee 
> data integrity, what it can't do is guarantee data availability.  The problem 
> is, the way ZFS is marketed people expect it to be able to do just that.
> 
> This turned into a longer thread than expected, so I'll start with what I'm 
> asking for, and then attempt to explain my thinking.  I'm essentially asking 
> for two features to improve the availability of ZFS pools:
> 
> - Isolation of storage drivers so that buggy drivers do not bring down the OS.
> 
> - ZFS timeouts to improve pool availability when no timely response is 
> received from storage drivers.
> 
> And my reasons for asking for these is that there are now many, many posts on 
> here about people experiencing either total system lockup or ZFS lockup after 
> removing a hot swap drive, and indeed while some of them are using consumer 
> hardware, others have reported problems with server grade kit that definately 
> should be able to handle these errors:
> 
> Aug 2008:  AMD SB600 - System hang
>  - http://www.opensolaris.org/jive/thread.jspa?threadID=70349
> Aug 2008:  Supermicro SAT2-MV8 - System hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=271218
> May 2008: Sun hardware - ZFS hang
>  - http://opensolaris.org/jive/thread.jspa?messageID=240481
> Feb 2008:  iSCSI - ZFS hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=206985
> Oct 2007:  Supermicro SAT2-MV8 - system hang
>  - http://www.opensolaris.org/jive/thread.jspa?messageID=166037
> Sept 2007:  Fibre channel
>  - http://opensolaris.org/jive/thread.jspa?messageID=151719
> ... etc
> 
> Now while the root cause of each of these may be slightly different, I feel 
> it would still be good to address this if possible as it's going to affect 
> the perception of ZFS as a reliable system.
> 
> The common factor in all of these is that either the solaris driver hangs and 
> locks the OS, or ZFS hangs and locks the pool.  Most of these are for 
> hardware that should handle these failures fine (mine occured for hardware 
> that definately works fine under windows), so I'm wondering:  Is there 
> anything that can be done to prevent either type of lockup in these 
> situations?
> 
> Firstly, for the OS, if a storage component (hardware or driver) fails for a 
> non essential part of the system, the entire OS should not hang.  I 
> appreciate there isn't a lot you can do if the OS is using the same driver as 
> it's storage, but certainly in some of the cases above, the OS and the data 
> are using different drivers, and I expect more examples of that could be 
> found with a bit of work.  Is there any way storage drivers could be isolated 
> such that the OS (and hence ZFS) can report a problem with that particular 
> driver without hanging the entire system?
> 
> Please note:  I know work is being done on FMA to handle all kinds of bugs, 
> I'm not talking about that.  It seems to me that FMA involves proper 
> detection and reporting of bugs, which involves knowing in advance what the 
> problems are and how to report them.  What I'm looking for is something much 
> simpler, something that's able to keep the OS running when it encounters 
> unexpected or unhandled behaviour from storage drivers or hardware.
> 
> It seems to me that one of the benefits of ZFS is working against it here.  
> It's such a flexible system it's being used for many, many types of devices, 
> and that means there are a whole host of drivers being used, and a lot of 
> scope for bugs in those drivers.  I know that ultimately any driver issues 
> will need to be sorted individually, but what I'm wondering is whether 
> there's any possibility of putting some error checking code at a layer above 
> the drivers in such a way it's able to trap major problems without hanging 
> the OS?  ie: update ZFS/Solaris so they can handle storage layer bugs 
> gracefully without downing the entire system.
> 
> My second suggestion is to ask if ZFS can be made to handle unexpected events 
> more gracefully.  In the past I've suggested that ZFS have a separate timeout 
> so that a redundant pool can continue working even if one device is not 
> responding, and I really think that would be worthwhile.  My idea is to have 
> a "WAITING" status flag for drives, so that if one isn't responding quickly, 
> ZFS can flag it as "WAITING", and attempt to read or write the same data from 
> elsewhere in the pool.  That would work alongside the existing failure modes, 
> and would allow ZFS to handle hung drivers much more smoothly, preventing 
> redundant pools hanging when a single drive fails.
> 
> The ZFS update I feel is particularly appropriate.  ZFS already uses 
> checksumming since it doesn't trust drivers or hardware to always return the 
> correct data.  But ZFS then trusts those same drivers and hardware absolutely 
> when it comes to the availability of the pool.
> 
> I believe ZFS should apply the same tough standards to pool availability as 
> it does to data integrity.  A bad checksum makes ZFS read the data from 
> elsewhere, why shouldn't a timeout do the same thing?
> 
> Ross
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to