This sounds like exactly the kind of problem I've been shouting about for 6 months or more. I posted a huge thread on availability on these forums because I had concerns over exactly this kind of hanging.
ZFS doesn't trust hardware or drivers when it comes to your data - everything is checksummed. However, when it comes to seeing whether devices are responding, and checking for faults, it blindly trusts whatever the hardware or driver tells it. Unfortunately, that means ZFS is vulnerable to any unexpected bug or error in the storage chain. I've encountered at least two hang conditions myself (and I'm not exactly a heavy user), and I've seen several others on the forums, including a few on x4500's. Now, I do accept that errors like this will be few and far between, but they still means you have the risk that a badly handled error condition can hang your entire server, instead of just one drive. Solaris can handle things like CPU's or Memory going faulty for crying out loud. Its raid storage system had better be able to handle a disk failing. Sun seem to be taking the approach that these errors should be dealt with in the driver layer. And while that's technically correct, a reliable storage system had damn well better be able to keep the server limping along while we wait for patches to the storage drivers. ZFS absolutely needs an error handling layer between the volume manager and the devices. It needs to timeout items that are not responding, and it needs to drop bad devices if they could cause problems elsewhere. And yes, I'm repeating myself, but I can't understand why this is not being acted on. Right now the error checking appears to be such that if an unexpected, or badly handled error condition occurs in the driver stack, the pool or server hangs. Whereas the expected behavior would be for just one drive to fail. The absolute worst case scenario should be that an entire controller has to be taken offline (and I would hope that the controllers in an x4500 would be running separate instances of the driver software). None one of those conditions should be fatal, good storage designs cope with them all, and good error handling at the ZFS layer is absolutely vital when you have projects like Comstar introducing more and more types of storage device for ZFS to work with. Each extra type of storage introduces yet more software into the equation, and increases the risk of finding faults like this. While they will be rare, they should be expected, and ZFS should be designed to handle them. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss