Heh, yeah, I've thought the same kind of thing in the past. The problem is that the argument doesn't really work for system admins.
As far as I'm concerned, the 7000 series is a new hardware platform, with relatively untested drivers, running a software solution that I know is prone to locking up when hardware faults are handled badly by drivers. Fair enough, that actual solution is out of our price range, but I would still be very dubious about purchasing it. At the very least I'd be waiting a year for other people to work the kinks out of the drivers. Which is a shame, because ZFS has so many other great features it's easily our first choice for a storage platform. The one and only concern we have is its reliability. We have snv_106 running as a test platform now. If I felt I could trust ZFS 100% I'd roll it out tomorrow. On Thu, Feb 12, 2009 at 4:25 PM, Tim <t...@tcsac.net> wrote: > > > On Thu, Feb 12, 2009 at 9:25 AM, Ross <myxi...@googlemail.com> wrote: >> >> This sounds like exactly the kind of problem I've been shouting about for >> 6 months or more. I posted a huge thread on availability on these forums >> because I had concerns over exactly this kind of hanging. >> >> ZFS doesn't trust hardware or drivers when it comes to your data - >> everything is checksummed. However, when it comes to seeing whether devices >> are responding, and checking for faults, it blindly trusts whatever the >> hardware or driver tells it. Unfortunately, that means ZFS is vulnerable to >> any unexpected bug or error in the storage chain. I've encountered at least >> two hang conditions myself (and I'm not exactly a heavy user), and I've seen >> several others on the forums, including a few on x4500's. >> >> Now, I do accept that errors like this will be few and far between, but >> they still means you have the risk that a badly handled error condition can >> hang your entire server, instead of just one drive. Solaris can handle >> things like CPU's or Memory going faulty for crying out loud. Its raid >> storage system had better be able to handle a disk failing. >> >> Sun seem to be taking the approach that these errors should be dealt with >> in the driver layer. And while that's technically correct, a reliable >> storage system had damn well better be able to keep the server limping along >> while we wait for patches to the storage drivers. >> >> ZFS absolutely needs an error handling layer between the volume manager >> and the devices. It needs to timeout items that are not responding, and it >> needs to drop bad devices if they could cause problems elsewhere. >> >> And yes, I'm repeating myself, but I can't understand why this is not >> being acted on. Right now the error checking appears to be such that if an >> unexpected, or badly handled error condition occurs in the driver stack, the >> pool or server hangs. Whereas the expected behavior would be for just one >> drive to fail. The absolute worst case scenario should be that an entire >> controller has to be taken offline (and I would hope that the controllers in >> an x4500 would be running separate instances of the driver software). >> >> None one of those conditions should be fatal, good storage designs cope >> with them all, and good error handling at the ZFS layer is absolutely vital >> when you have projects like Comstar introducing more and more types of >> storage device for ZFS to work with. >> >> Each extra type of storage introduces yet more software into the equation, >> and increases the risk of finding faults like this. While they will be >> rare, they should be expected, and ZFS should be designed to handle them. > > > I'd imagine for the exact same reason short-stroking/right-sizing isn't a > concern. > > "We don't have this problem in the 7000 series, perhaps you should buy one > of those". > > ;) > > --Tim > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss