>>>>> "sj" == Shawn Joy <shawn....@sun.com> writes:

    sj> Can you explain in, simple terms, how ZFS now reacts
    sj> to this? 

I can't.  :) I think Victor's long message made a lot of sense.  The
failure modes with a SAN are not simple.  At least there is the
difference of whether the target's write buffer was lost after a
transient failure or not, and the current storage stack assumes it's
never lost.

IMHO, SAN's are in general broken by design because their software
stacks don't deal predictably with common network failure modes (like
the target rebooting, but the initiator staying up).  The standard
that would qualify to me as ``deal predictably'' would be what NFS
provides:

 * writes are double-cached on client and server, so the client can
   replay them if the server crashes.  To my limited knowledge, no SAN
   stack does this.  Expensive SAN's can limit the amount of data at
   risk with NVRAM, but it seems like there would always be a little
   bit of data in-flight.  

   A cost-conscious Solaris iSCSI target will put a quite large amount
   of data at risk between sync-cache commands.

   This is okay, just as it's ok for NFS servers, but only if all the
   initiators reboot whenver the target reboots.

   Doing the client side part of the double-caching is a little tricky
   because I think you really want to do it pretty high in the storage
   stack, maybe in ZFS rather than in the initiator, or else you will
   be triple-caching a TXG (twice on the client, once on the server)
   which can be pretty big.  This means introducing the idea that a
   sync-cache command can fail, and that when it does, none/some/all
   of the writes between the last sync-cache that succeeded and the
   current one that failed may have been silently lost even if those 
   write commands were ack'd successful when they were issued.

 * the bcp for NFS mount type is 'hard,intr' meaning, retry forever if
   there is a failure.  If you want to stop retrying, whatever app was
   doing the writing gets killed.  This rule means any database file
   that got ``intr'd'' will be crash-consistent.

   The SAN equivalent of 'intr' would be force-unmounting the
   filesystem (and force-unmounting implies either killing processes
   with open files or giving persistent errors to any open
   filehandles).  I'm pretty sure no SAN stack does this intentionally
   whenever it's needed---rather it just sort of happens sometimes
   depending on how errors percolate upwards through various
   nested cargo-cult timeouts.

   I guess it would be easy to add to a first order---just make SAN
   targets stay down forever after they bounce until ZFS marks them
   offline.  The tricky part is the complaints you get after: ``how do
   I add this target back without rebooting?'', ``do I really have to
   resilver?  It's happening daily so I'm basically always
   resilvering.'', ``we are going down twice a day because of harmless
   SAN glitches that we never noticed before---is this really
   necessary?''  I think I remember some post that made it sound like
   people were afraid to touch any of the storage exception handling
   because no one knows what cases are really captured by the many
   stupid levels of timeouts and retries.

In short, to me it sounds like the retry state machines of SAN
initiators are broken by design, across the board.  They make the same
assumption they did for local storage: the only time data in a target
write buffer will get lost is during a crash-reboot.  This is wrong
not only for SAN's but also for hot-pluggable drives which can have
power sags that get wrongly treated the same way as CRC errors on the
data cable.  It's possible to get it right, like NFS is right, but
instead the popular fix with most people is to leave the storage stack
broken but make ZFS more resilient to this type of corruption, like
other filesystems are, because resilience is good, and people are
always twitchy and frightened and not expecting strictly consistent
behavior around their SAN's anyway so the problem is rare.

So far SAN targets have been proprietary, so vendors are free to
conceal this problem with protocol tweaks, expensive NVRAM's, and
giving undefended or fuzzed advice through their support channels to
their paranoid, accepting sysadmins.  Whatever free and open targets
behaved differently were assumed to be ``immature.''  Hopefully now
that SAN's are opening up this SAN write hole will finally get plugged
somehow,

...maybe with one of the two * points above, and if we were to pick
the second * then we'd probably need some notion of a ``target boot
cookie'' so we only take the 'intr'-like force-unmount path in the
cases where it's really needed.

    sj> Do we all agree that creating a zpool out of one device in a
    sj> SAN environment is not recommended.

This is still a good question.  The stock response is ``ZFS needs to
manage at least one layer of <blah blah>'', but this problem (SAN
target reboots while initiator does not) isn't unexplained
storagechaos or cosmic bitflip gremlins.  Does anyone know if however
much zpool redundancy helps with this type of event has changed
before/after b77?

Attachment: pgp6TVfSLF1bn.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to