>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>>>> "pf" == Paul Fisher <[EMAIL PROTECTED]> writes:

    re> I was able to reproduce this in b93, but might have a
    re> different interpretation

You weren't able to reproduce the hang of 'zpool status'?
Your 'zpool status' was after the FMA fault kicked in, though.  How
about before FMA decided to mark the pool faulted---did 'zpool status'
hang, or work?  If it worked, what did it report?

The 'zpool status' hanging happens for me on b71 when an iSCSI target
goes away.  (IIRC 'iscsiadm remove discovery-address ...' unwedges
zpool status for me, but my notes could be more careful.)

    re> However, the default failmode property is set to "wait" which
    re> will patiently wait forever.  If you would rather have the I/O
    re> fail, then you should change the failmode to "continue"

for him, it sounds like it's not doing either.  I think he does not
have the failmode property, since it is so new?

It sounds like 'continue' should return I/O errors sooner than 9
minutes after the unredundant disks generate them (but not at all for
degraded redundant pools of course).  And it sounds like 'wait' should
block the writing program, forever if necessary, like an NFS hard
mount.

  (1) Is the latter what 'wait' actually did for you?  Or did the
      writing process get I/O errors after the 9-minutes-later FMA
      diagnosis?

  (2) is it like NFS 'hard' or is it like 'hard,intr'? :)

It's great to see these things improving.

    pf> Wow! Who knew that 17, 951 was the magic number...  Seriously,
    pf> this does seem like an "excessive amount of certainty".

I agree it's an awfully forgiving constant, so big that it sounds like
it might not be a constant manually set to 16384 or something, but
rather an accident.  I'm surprised to find FMA is responsible for
deciding the length of this 9-minute (or more, for Ross) delay.

note that, if the false positives one is trying to filter out are
things like USB/SAN cabling spasms and drive recalibrations, the right
metric is time, not number of failed CDB's.

The hugely-delayed response may be a blessing in disguise though,
because arranging for the differnet FMA states to each last tens of
minutes means it's possible to evaluate the system's behavior in each
state, to see if it's correct.  For example, within this 9-minute
window:

 * what does 'zpool status' say before the FMA faulting

 * what do applications experience, ex., 

   + is it possible to get an I/O error during this window with failmode=wait?  
how about with failmode=continue?  

   + are reads and writes that block interruptible or uninterruptible?

   + What about fsync()?  

     o what about fsync() if there is a slog?

 * is the system stable or are there ``lazy panic'' cases?

   + what if you ``ask for it'' by calling 'zpool clear' or 'zpool
     scrub' within the 9-minute window?

 * are other pools that don't include failed devices affected (for
   reading/writing.  but, also, if 'zpool status' is frozen for all
   pools, then other pools are affected.)

 * probably other stuff...

God willing some day some of the states can be shortened to values
more like 1 second or 1 minute, or really aggressive
variance-and-average-based threshholds like TCP timers, so that FMA is
actually useful rather than a step backwards from SVM as it seems to
me right now.  The NetApp paper Richard posted earlier was saying
NetApp never waits the 30 seconds for an ATAPI error, they just ignore
the disk if it doesn't answer within 1000ms or so.  But my crappy
Linux iSCSI targets would probably miss 1000ms timeouts all the time
just because they're heavily loaded---you could get pools that go
FAULTED whenever they get heavy use.

so some of FMA's states maybe should be short, but they're harder to
observe when they're so short. The point of FMA, AIUI, is to make the
failure state machine really complicated.  We want it complicated to
deal with both netapp's good example of aggressive timers and also
deal with my crappy Linux IET setup, so increasingly hairy rules can
be written with experience.  Complicated means that observing each
state is important to verify the complicated system's correctness.
And observing means they can't be 1 second long even if that's the
appropriate length.  But I don't know if that's really the developer's
intent, or just my dreaming and hoping.

Attachment: pgpAT0ZOB5awi.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to