Heh, yeah, I've thought the same kind of thing in the past.  The
problem is that the argument doesn't really work for system admins.

As far as I'm concerned, the 7000 series is a new hardware platform,
with relatively untested drivers, running a software solution that I
know is prone to locking up when hardware faults are handled badly by
drivers.  Fair enough, that actual solution is out of our price range,
but I would still be very dubious about purchasing it.  At the very
least I'd be waiting a year for other people to work the kinks out of
the drivers.

Which is a shame, because ZFS has so many other great features it's
easily our first choice for a storage platform.  The one and only
concern we have is its reliability.  We have snv_106 running as a test
platform now.  If I felt I could trust ZFS 100% I'd roll it out
tomorrow.



On Thu, Feb 12, 2009 at 4:25 PM, Tim <t...@tcsac.net> wrote:
>
>
> On Thu, Feb 12, 2009 at 9:25 AM, Ross <myxi...@googlemail.com> wrote:
>>
>> This sounds like exactly the kind of problem I've been shouting about for
>> 6 months or more.  I posted a huge thread on availability on these forums
>> because I had concerns over exactly this kind of hanging.
>>
>> ZFS doesn't trust hardware or drivers when it comes to your data -
>> everything is checksummed.  However, when it comes to seeing whether devices
>> are responding, and checking for faults, it blindly trusts whatever the
>> hardware or driver tells it.  Unfortunately, that means ZFS is vulnerable to
>> any unexpected bug or error in the storage chain.  I've encountered at least
>> two hang conditions myself (and I'm not exactly a heavy user), and I've seen
>> several others on the forums, including a few on x4500's.
>>
>> Now, I do accept that errors like this will be few and far between, but
>> they still means you have the risk that a badly handled error condition can
>> hang your entire server, instead of just one drive.  Solaris can handle
>> things like CPU's or Memory going faulty for crying out loud.  Its raid
>> storage system had better be able to handle a disk failing.
>>
>> Sun seem to be taking the approach that these errors should be dealt with
>> in the driver layer.  And while that's technically correct, a reliable
>> storage system had damn well better be able to keep the server limping along
>> while we wait for patches to the storage drivers.
>>
>> ZFS absolutely needs an error handling layer between the volume manager
>> and the devices.  It needs to timeout items that are not responding, and it
>> needs to drop bad devices if they could cause problems elsewhere.
>>
>> And yes, I'm repeating myself, but I can't understand why this is not
>> being acted on.  Right now the error checking appears to be such that if an
>> unexpected, or badly handled error condition occurs in the driver stack, the
>> pool or server hangs.  Whereas the expected behavior would be for just one
>> drive to fail.  The absolute worst case scenario should be that an entire
>> controller has to be taken offline (and I would hope that the controllers in
>> an x4500 would be running separate instances of the driver software).
>>
>> None one of those conditions should be fatal, good storage designs cope
>> with them all, and good error handling at the ZFS layer is absolutely vital
>> when you have projects like Comstar introducing more and more types of
>> storage device for ZFS to work with.
>>
>> Each extra type of storage introduces yet more software into the equation,
>> and increases the risk of finding faults like this.  While they will be
>> rare, they should be expected, and ZFS should be designed to handle them.
>
>
> I'd imagine for the exact same reason short-stroking/right-sizing isn't a
> concern.
>
> "We don't have this problem in the 7000 series, perhaps you should buy one
> of those".
>
> ;)
>
> --Tim
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to