>>>>> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:
es> Are you running your experiments on build 101 or later? no. aside from that quick one for copies=2 im pretty bad about running well-designed experiments. and I do have old builds. I need to buy more hardware. It's hard to know how to get the most stable system. I bet it'll be a year before this b101 stuff makes it into stable Solaris, yet the bleeding-edge improvements are all stability-related, so for mostly-ZFS jobs maybe it's better to run SXCE than sol10 in production. I suppose I should be happy about that since it means more people will have some source. :) es> P.S. I'm also not sure that B_FAILFAST behaves in the way you es> think it does. My reading of sd.c seems to imply that much of es> what you suggest is actually how it currently behaves, Yeah, I got a private email referring me to the spec for PSARC/2002/126 which already included both pieces I hoped for (killing queued CDB's, and statefully tracking each device as failed/good), so I take back what I said about B_FAILFAST being useless---it should be able to help the ZFS availability problems we've seen. The PSARC says B_FAILFAST is implemented in the ``disk driver'' which AIUI is above the controller, just as I hoped, but there is more than one ``disk driver'' so the B_FAILFAST stuff is not factored out to one spot the way a vdev-level system would be but rather punted downwards and paste-and-raped into sd, ssd, dad, ...., so whatever experience you get with it isn't necessarily portable to disks with a different kind of attachment. I still think the vdev-layer logic could make better decisions by using more than the 1 bit of information per device, but maybe 1-bit B_FAILFAST is enough to make me accept the shortfall as an arguable-feature rather than a unanimous-bug. Also if it can fix my (1) and (2) with FMA then maybe the gap between B_FAILFAST and real NetApp-like drive diagnosis can be done partly in userspace the way developers seem to want. The problems this doesn't cover are write-related: * what should we do about implicit and explicit fsync()s where all the data is already on stable storage, but not with full redundancy---one device won't finish writing? I think there should not be transparent recovery from this, though maybe others disagree. but pool-level failmode doesn't settle the issue: (a) _when_ will you take the failure action (if failmode != wait)? The property says *what* to do, not *when* to do it. (b) There isn't any vdev-level failure, only device-level, so it's not appropriate to consult the failmode property in the first place---the situation is different. The question is, do we keep trying, or do we transition the device to FAULTED and the vdev to DEGRADED so that fsync()'s can proceed without that device and hotspare resilver kicks in? (c) Inside the time interval between when the device starts writing slowly and when you take the (b) action, how well can you isolate the failure? For example, can you insure that read-only access remains instantaneous, even though atime updates involve writing, even though these 5-second txg-flushes are blocked, and even though the admin might (gasp!) type 'zpool status'---or even a label-writing command like 'zpool attach'? or will one of those three things cause pool-wide or ZFS-wide hang that blocks read access which could theoretically work? * commands like zpool attach, detach, replace, offline, export (a) should not be uninterruptably hangable. (b) Problems in one pool should not spill over into another. (c) And finally they should be forcable even when they can't write everything they'd like to, so that rebooting isn't a necessary move in certain kinds failure-recovery of pool gymnastics. I expect there's some quiet work on this in b101 also---at least someone said 'zpool status' isn't supposed to hang anymore? so I'll have to try it out, but B_FAILFAST isn't enough to settle the whole issue, even modulo marginal performance improvement that more ambitiously wacky schemes might promise us.
pgp8eFnBFQAmv.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss