>>>>> "gm" == Gary Mills <mi...@cc.umanitoba.ca> writes:

    gm> Is there any more that I've missed?

1. Filesystem/RAID layer dispatches writes 'aaaaaaaaa' to iSCSI
   initiator.  iSCSI initiator accepts them, buffers them, returns
   success to RAID layer.

2. iSCSI initiator sends to iSCSI target.  iSCSI Target writes 'aaaaaaaa'.

3. Network connectivity is interrupted, target is rebooted, something like that.

4. Filesystem/RAID layer dispatches writes 'bbbbbbbb' to iSCSI
   initiator.  initiator accepts, buffers, returns success.

5. iSCSI initiator can't write 'bbbbbbbb'

6. iSCSI initiator goes through some cargo-cult error-recovery scheme.
   retry this 3 times, timeout, disconnect, reconnect, retry
   really-hard 5 times, timeout, return various errors to RAID layer,
   maybe.

7. OH!  Target's back!  good.

8. Filesystem/RAID layer writes 'ccccccccc' to iSCSI initiator.  maybe
   gets an error. maybe flags 'ccccccccc' destination blocks bad,
   increments RAID-layer coutners, tries to ``rewrite'' the
   'cccccccc', eventually gets success back from the initiator.

9. Filesystem/RAID layer issues SYNCHRONIZE CACHE to the iSCSI
   initiator.

10. initiator flushes 'cccccccc' to the target, and waits for target
    to confirm 'ccccccc' and all previous writes are on physical
    media.

11. initiator returns success for the SYNCHRONIZE CACHE command.

12. Filesystem/RAID layer writes 'd' commit sector updating pointers,
    aiming various important things at 'bbbbbbbbb'

Now, the RAID layer thinks 'aaaaaaaaa' and 'bbbbbbbbb' and 'ccccccccc'
and 'd' are all written, but in fact only 'aaaaaaaaa' and 'cccccccccc'
and 'd' are written, and 'd' points at garbage.

NFS has a state machine designed to handle server reboots without
breaking any consistency promises.  Substitute ``the userland app''
for Filesystem/RAID, and ``NFSv3 client'' for iSCSI initiator.  The
NFSv3 client will keep track of which writes are actually committed to
disk and batch them into commit blocks of which the userland app is
entirely unaware.  The NFS client won't free a commit block from its
RAM write cache until it's on disk.  If the server reboots it will
replay the open commit blocks.  If the server AND client reboot the
commit block will be lost from RAM, but then 'd' are not written, so
the datastore is not corrupt.  The iSCSI initiator probably needs to
do something similar to NFSv3 to enforce that success from SYNCHRONIZE
CACHE really means what ZFS thinks it means.

It's a little trickier to do this with ZFS/iSCSI because the NFS
cop-out was to use 'hard' mounts---you _never_ propogate write
failures up the stack.  You just freeze the application until you can
finally complete the write, and if you can't write you evade the
consistency guarantees by killing the app.  Then, it's a solveable
problem to design apps that won't corrupt their datastores when
they're killed, so the overall system works.  This world order won't
work analagously for ZFS-on-iSCSI which needs to see failures to
handle redundancy.

We may even need some new kind of failure code to solve the problem,
but maybe something clever can be crammed into the old API.  Imagine
the stream of writes to a disk as a bucket-brigade separated by
SYNCHRONIZE CACHE commands.  The writes within each bucket can be
sloshed around (reordered) arbitrarily.  And if the machine crashes,
we might pour _part_ of the water in the last bucket on the fire, but
then we stop and drop all the other buckets.  So far, we can handle
it.

But we've no way to handle the situation where someone in the _middle_
of the brigade spills the water in his bucket.  There's no way to
cleanly restart the brigade after this happens.  ZFS needs to
gracefully handle a SYNCHRONIZE CACHE command that returns _failure_,
and needs to interpret such a failure really aggressively, as in:

  Any writes you issued since the last SYNCHRONIZE CACHE, *even if you
  got a Success return to your block-layer write() command*, may or
  may not be committed to disk, and waiting will NOT change the
  situation---they're just gone.  But, the disk is still here, and is
  working, meh, ~fine.

  This failure is not ``retryable''.  If you issue a second
  SYNCHRONIZE CACHE command, and it Succeeds, that does NOT change
  what I've just told you.  That Success only referrs to writes issued
  between this failing SYNCHRONIZE CACHE command and the next one.

Once iSCSI initiator is fixed, probably we need to go back and add
NFS-style commit batches to even SATA disk drivers which can suffer
the same problem if you hot-swap them, or maybe even if you don't
hot-swap but the disk reports some error which invokes some convoluted
sd/ssd exception handling involving ``resets''.  The assumption
doesn't hold that write, write, write, synchronize cache promises all
those writes are on-disk once synchronize cache returns.  The only way
to make it hold is to promise to panic the kernel whenever any disk,
controller, bus, or iscsi session is ``reset''---the simple, obvious
``SYNCHRONIZE CACHE is the final word of God'' assumption ought to
handle cord-yanking just fine, but not smaller failures.

Attachment: pgpxHJh4l1f7Y.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to