>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes: >>>>> "pf" == Paul Fisher <[EMAIL PROTECTED]> writes:
re> I was able to reproduce this in b93, but might have a re> different interpretation You weren't able to reproduce the hang of 'zpool status'? Your 'zpool status' was after the FMA fault kicked in, though. How about before FMA decided to mark the pool faulted---did 'zpool status' hang, or work? If it worked, what did it report? The 'zpool status' hanging happens for me on b71 when an iSCSI target goes away. (IIRC 'iscsiadm remove discovery-address ...' unwedges zpool status for me, but my notes could be more careful.) re> However, the default failmode property is set to "wait" which re> will patiently wait forever. If you would rather have the I/O re> fail, then you should change the failmode to "continue" for him, it sounds like it's not doing either. I think he does not have the failmode property, since it is so new? It sounds like 'continue' should return I/O errors sooner than 9 minutes after the unredundant disks generate them (but not at all for degraded redundant pools of course). And it sounds like 'wait' should block the writing program, forever if necessary, like an NFS hard mount. (1) Is the latter what 'wait' actually did for you? Or did the writing process get I/O errors after the 9-minutes-later FMA diagnosis? (2) is it like NFS 'hard' or is it like 'hard,intr'? :) It's great to see these things improving. pf> Wow! Who knew that 17, 951 was the magic number... Seriously, pf> this does seem like an "excessive amount of certainty". I agree it's an awfully forgiving constant, so big that it sounds like it might not be a constant manually set to 16384 or something, but rather an accident. I'm surprised to find FMA is responsible for deciding the length of this 9-minute (or more, for Ross) delay. note that, if the false positives one is trying to filter out are things like USB/SAN cabling spasms and drive recalibrations, the right metric is time, not number of failed CDB's. The hugely-delayed response may be a blessing in disguise though, because arranging for the differnet FMA states to each last tens of minutes means it's possible to evaluate the system's behavior in each state, to see if it's correct. For example, within this 9-minute window: * what does 'zpool status' say before the FMA faulting * what do applications experience, ex., + is it possible to get an I/O error during this window with failmode=wait? how about with failmode=continue? + are reads and writes that block interruptible or uninterruptible? + What about fsync()? o what about fsync() if there is a slog? * is the system stable or are there ``lazy panic'' cases? + what if you ``ask for it'' by calling 'zpool clear' or 'zpool scrub' within the 9-minute window? * are other pools that don't include failed devices affected (for reading/writing. but, also, if 'zpool status' is frozen for all pools, then other pools are affected.) * probably other stuff... God willing some day some of the states can be shortened to values more like 1 second or 1 minute, or really aggressive variance-and-average-based threshholds like TCP timers, so that FMA is actually useful rather than a step backwards from SVM as it seems to me right now. The NetApp paper Richard posted earlier was saying NetApp never waits the 30 seconds for an ATAPI error, they just ignore the disk if it doesn't answer within 1000ms or so. But my crappy Linux iSCSI targets would probably miss 1000ms timeouts all the time just because they're heavily loaded---you could get pools that go FAULTED whenever they get heavy use. so some of FMA's states maybe should be short, but they're harder to observe when they're so short. The point of FMA, AIUI, is to make the failure state machine really complicated. We want it complicated to deal with both netapp's good example of aggressive timers and also deal with my crappy Linux IET setup, so increasingly hairy rules can be written with experience. Complicated means that observing each state is important to verify the complicated system's correctness. And observing means they can't be 1 second long even if that's the appropriate length. But I don't know if that's really the developer's intent, or just my dreaming and hoping.
pgpAT0ZOB5awi.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss