On Thu, Feb 12, 2009 at 9:25 AM, Ross <myxi...@googlemail.com> wrote:
> This sounds like exactly the kind of problem I've been shouting about for 6 > months or more. I posted a huge thread on availability on these forums > because I had concerns over exactly this kind of hanging. > > ZFS doesn't trust hardware or drivers when it comes to your data - > everything is checksummed. However, when it comes to seeing whether devices > are responding, and checking for faults, it blindly trusts whatever the > hardware or driver tells it. Unfortunately, that means ZFS is vulnerable to > any unexpected bug or error in the storage chain. I've encountered at least > two hang conditions myself (and I'm not exactly a heavy user), and I've seen > several others on the forums, including a few on x4500's. > > Now, I do accept that errors like this will be few and far between, but > they still means you have the risk that a badly handled error condition can > hang your entire server, instead of just one drive. Solaris can handle > things like CPU's or Memory going faulty for crying out loud. Its raid > storage system had better be able to handle a disk failing. > > Sun seem to be taking the approach that these errors should be dealt with > in the driver layer. And while that's technically correct, a reliable > storage system had damn well better be able to keep the server limping along > while we wait for patches to the storage drivers. > > ZFS absolutely needs an error handling layer between the volume manager and > the devices. It needs to timeout items that are not responding, and it > needs to drop bad devices if they could cause problems elsewhere. > > And yes, I'm repeating myself, but I can't understand why this is not being > acted on. Right now the error checking appears to be such that if an > unexpected, or badly handled error condition occurs in the driver stack, the > pool or server hangs. Whereas the expected behavior would be for just one > drive to fail. The absolute worst case scenario should be that an entire > controller has to be taken offline (and I would hope that the controllers in > an x4500 would be running separate instances of the driver software). > > None one of those conditions should be fatal, good storage designs cope > with them all, and good error handling at the ZFS layer is absolutely vital > when you have projects like Comstar introducing more and more types of > storage device for ZFS to work with. > > Each extra type of storage introduces yet more software into the equation, > and increases the risk of finding faults like this. While they will be > rare, they should be expected, and ZFS should be designed to handle them. > I'd imagine for the exact same reason short-stroking/right-sizing isn't a concern. "We don't have this problem in the 7000 series, perhaps you should buy one of those". ;) --Tim
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss