>>>>> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:
es> Finally, imposing additional timeouts in ZFS is a bad idea. es> [...] As such, it doesn't have the necessary context to know es> what constitutes a reasonable timeout. you're right in terms of fixed timeouts, but there's no reason it can't compare the performance of redundant data sources, and if one vdev performs an order of magnitude slower than another set of vdevs with sufficient redundancy, stop issuing reads except scrubs/healing to the underperformer (issue writes only), and pass an event to FMA. ZFS can also compare the performance of a drive to itself over time, and if the performance suddenly decreases, do the same. The former case eliminates the need for the mirror policies in SVM, which Ian requested a few hours ago for the situation that half the mirror is a slow iSCSI target for geographic redundancy and half is faster/local. Some care would have to be taken for targets shared by ZFS and some other initiator, but I'm not sure the care would really be that difficult to take, or that the oscillations induced by failing to take it would really be particularly harmful compared to unsupervised contention for a device. The latter notices quickly drives that have been pulled, or for Richard's ``overwhelmingly dominant'' case, for drives which are stalled for 30 seconds pending their report of an unrecovered read. Developing meaningful performance statistics for drives and a tool for displaying them would be useful for itself, not just for stopping freezes and preventing a failing drive from degrading performance a thousandfold. Issuing reads to redundant devices is cheap compared to freezing. The policy with which it's done is highly tunable and should be fun to tune and watch, and the consequence if the policy makes the wrong choice isn't incredibly dire. This B_FAILFAST architecture captures the situation really poorly. First, it's not implementable in any serious way with near-line drives, or really with any drives with which you're not intimately familiar and in control of firmware/release-engineering, and perhaps not with any drives period. I suspect in practice it's more a controller-level feature, about whether or not you'd like to distrust the device's error report and start resetting busses and channels and mucking everything up trying to recover from some kind of ``weirdness''. It's not an answer to the known problem of drives stalling for 30 seconds when they start to fail. First and a half, when it's not implemented, the system degrades to doubling your timeout pointlessly. A driver-level block cache of UNC's would probably have more value toward this speed/read-aggressiveness tradeoff than the whole B_FAILFAST architecture---just cache known unrecoverable read sectors, and refuse to issue further I/O for them until a timeout of 3 - 10 minutes passes. I bet this would speed up most failures tremendously, and without burdening upper layers with retry logic. Second, B_FAILFAST entertains the fantasy that I/O's are independent, while what happens in practice is that the drive hits a UNC on one I/O, and won't entertain any further I/O's no matter what flags the request has on it or how many times you ``reset'' things. Maybe you could try to rescue B_FAILFAST by putting clever statistics into the driver to compare the drive's performance to recent past as I suggested ZFS do, and admit no B_FAILFAST requests to queues of drives that have suddenly slowed down, just fail them immediately without even trying. I submit this queueing and statistic collection is actually _better_ managed by ZFS than the driver because ZFS can compare a whole floating-point statistic across a whole vdev, while even a driver which is fancier than we ever dreamed, is still playing poker with only 1 bit of input ``I'll call,'' or ``I'll fold.'' ZFS can see all the cards and get better results while being stupider and requiring less clever poker-guessing than would be required by a hypothetical driver B_FAILFAST implementation that actually worked.
pgpqZb7GbAEgk.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss