>>>>> "sj" == Shawn Joy <shawn....@sun.com> writes:
sj> Can you explain in, simple terms, how ZFS now reacts sj> to this? I can't. :) I think Victor's long message made a lot of sense. The failure modes with a SAN are not simple. At least there is the difference of whether the target's write buffer was lost after a transient failure or not, and the current storage stack assumes it's never lost. IMHO, SAN's are in general broken by design because their software stacks don't deal predictably with common network failure modes (like the target rebooting, but the initiator staying up). The standard that would qualify to me as ``deal predictably'' would be what NFS provides: * writes are double-cached on client and server, so the client can replay them if the server crashes. To my limited knowledge, no SAN stack does this. Expensive SAN's can limit the amount of data at risk with NVRAM, but it seems like there would always be a little bit of data in-flight. A cost-conscious Solaris iSCSI target will put a quite large amount of data at risk between sync-cache commands. This is okay, just as it's ok for NFS servers, but only if all the initiators reboot whenver the target reboots. Doing the client side part of the double-caching is a little tricky because I think you really want to do it pretty high in the storage stack, maybe in ZFS rather than in the initiator, or else you will be triple-caching a TXG (twice on the client, once on the server) which can be pretty big. This means introducing the idea that a sync-cache command can fail, and that when it does, none/some/all of the writes between the last sync-cache that succeeded and the current one that failed may have been silently lost even if those write commands were ack'd successful when they were issued. * the bcp for NFS mount type is 'hard,intr' meaning, retry forever if there is a failure. If you want to stop retrying, whatever app was doing the writing gets killed. This rule means any database file that got ``intr'd'' will be crash-consistent. The SAN equivalent of 'intr' would be force-unmounting the filesystem (and force-unmounting implies either killing processes with open files or giving persistent errors to any open filehandles). I'm pretty sure no SAN stack does this intentionally whenever it's needed---rather it just sort of happens sometimes depending on how errors percolate upwards through various nested cargo-cult timeouts. I guess it would be easy to add to a first order---just make SAN targets stay down forever after they bounce until ZFS marks them offline. The tricky part is the complaints you get after: ``how do I add this target back without rebooting?'', ``do I really have to resilver? It's happening daily so I'm basically always resilvering.'', ``we are going down twice a day because of harmless SAN glitches that we never noticed before---is this really necessary?'' I think I remember some post that made it sound like people were afraid to touch any of the storage exception handling because no one knows what cases are really captured by the many stupid levels of timeouts and retries. In short, to me it sounds like the retry state machines of SAN initiators are broken by design, across the board. They make the same assumption they did for local storage: the only time data in a target write buffer will get lost is during a crash-reboot. This is wrong not only for SAN's but also for hot-pluggable drives which can have power sags that get wrongly treated the same way as CRC errors on the data cable. It's possible to get it right, like NFS is right, but instead the popular fix with most people is to leave the storage stack broken but make ZFS more resilient to this type of corruption, like other filesystems are, because resilience is good, and people are always twitchy and frightened and not expecting strictly consistent behavior around their SAN's anyway so the problem is rare. So far SAN targets have been proprietary, so vendors are free to conceal this problem with protocol tweaks, expensive NVRAM's, and giving undefended or fuzzed advice through their support channels to their paranoid, accepting sysadmins. Whatever free and open targets behaved differently were assumed to be ``immature.'' Hopefully now that SAN's are opening up this SAN write hole will finally get plugged somehow, ...maybe with one of the two * points above, and if we were to pick the second * then we'd probably need some notion of a ``target boot cookie'' so we only take the 'intr'-like force-unmount path in the cases where it's really needed. sj> Do we all agree that creating a zpool out of one device in a sj> SAN environment is not recommended. This is still a good question. The stock response is ``ZFS needs to manage at least one layer of <blah blah>'', but this problem (SAN target reboots while initiator does not) isn't unexplained storagechaos or cosmic bitflip gremlins. Does anyone know if however much zpool redundancy helps with this type of event has changed before/after b77?
pgp6TVfSLF1bn.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss