On Thu, Feb 12, 2009 at 9:25 AM, Ross <myxi...@googlemail.com> wrote:

> This sounds like exactly the kind of problem I've been shouting about for 6
> months or more.  I posted a huge thread on availability on these forums
> because I had concerns over exactly this kind of hanging.
>
> ZFS doesn't trust hardware or drivers when it comes to your data -
> everything is checksummed.  However, when it comes to seeing whether devices
> are responding, and checking for faults, it blindly trusts whatever the
> hardware or driver tells it.  Unfortunately, that means ZFS is vulnerable to
> any unexpected bug or error in the storage chain.  I've encountered at least
> two hang conditions myself (and I'm not exactly a heavy user), and I've seen
> several others on the forums, including a few on x4500's.
>
> Now, I do accept that errors like this will be few and far between, but
> they still means you have the risk that a badly handled error condition can
> hang your entire server, instead of just one drive.  Solaris can handle
> things like CPU's or Memory going faulty for crying out loud.  Its raid
> storage system had better be able to handle a disk failing.
>
> Sun seem to be taking the approach that these errors should be dealt with
> in the driver layer.  And while that's technically correct, a reliable
> storage system had damn well better be able to keep the server limping along
> while we wait for patches to the storage drivers.
>
> ZFS absolutely needs an error handling layer between the volume manager and
> the devices.  It needs to timeout items that are not responding, and it
> needs to drop bad devices if they could cause problems elsewhere.
>
> And yes, I'm repeating myself, but I can't understand why this is not being
> acted on.  Right now the error checking appears to be such that if an
> unexpected, or badly handled error condition occurs in the driver stack, the
> pool or server hangs.  Whereas the expected behavior would be for just one
> drive to fail.  The absolute worst case scenario should be that an entire
> controller has to be taken offline (and I would hope that the controllers in
> an x4500 would be running separate instances of the driver software).
>
> None one of those conditions should be fatal, good storage designs cope
> with them all, and good error handling at the ZFS layer is absolutely vital
> when you have projects like Comstar introducing more and more types of
> storage device for ZFS to work with.
>
> Each extra type of storage introduces yet more software into the equation,
> and increases the risk of finding faults like this.  While they will be
> rare, they should be expected, and ZFS should be designed to handle them.
>


I'd imagine for the exact same reason short-stroking/right-sizing isn't a
concern.

"We don't have this problem in the 7000 series, perhaps you should buy one
of those".

;)

--Tim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to