>>>>> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:

    es> Finally, imposing additional timeouts in ZFS is a bad idea.
    es> [...] As such, it doesn't have the necessary context to know
    es> what constitutes a reasonable timeout.

you're right in terms of fixed timeouts, but there's no reason it
can't compare the performance of redundant data sources, and if one
vdev performs an order of magnitude slower than another set of vdevs
with sufficient redundancy, stop issuing reads except scrubs/healing
to the underperformer (issue writes only), and pass an event to FMA.

ZFS can also compare the performance of a drive to itself over time,
and if the performance suddenly decreases, do the same.

The former case eliminates the need for the mirror policies in SVM,
which Ian requested a few hours ago for the situation that half the
mirror is a slow iSCSI target for geographic redundancy and half is
faster/local.  Some care would have to be taken for targets shared by
ZFS and some other initiator, but I'm not sure the care would really
be that difficult to take, or that the oscillations induced by failing
to take it would really be particularly harmful compared to
unsupervised contention for a device.

The latter notices quickly drives that have been pulled, or for
Richard's ``overwhelmingly dominant'' case, for drives which are
stalled for 30 seconds pending their report of an unrecovered read.

Developing meaningful performance statistics for drives and a tool for
displaying them would be useful for itself, not just for stopping
freezes and preventing a failing drive from degrading performance a
thousandfold.

Issuing reads to redundant devices is cheap compared to freezing.  The
policy with which it's done is highly tunable and should be fun to
tune and watch, and the consequence if the policy makes the wrong
choice isn't incredibly dire.


This B_FAILFAST architecture captures the situation really poorly.

First, it's not implementable in any serious way with near-line
drives, or really with any drives with which you're not intimately
familiar and in control of firmware/release-engineering, and perhaps
not with any drives period.  I suspect in practice it's more a
controller-level feature, about whether or not you'd like to distrust
the device's error report and start resetting busses and channels and
mucking everything up trying to recover from some kind of
``weirdness''.  It's not an answer to the known problem of drives
stalling for 30 seconds when they start to fail.

First and a half, when it's not implemented, the system degrades to
doubling your timeout pointlessly.  A driver-level block cache of
UNC's would probably have more value toward this
speed/read-aggressiveness tradeoff than the whole B_FAILFAST
architecture---just cache known unrecoverable read sectors, and refuse
to issue further I/O for them until a timeout of 3 - 10 minutes
passes.  I bet this would speed up most failures tremendously, and
without burdening upper layers with retry logic.

Second, B_FAILFAST entertains the fantasy that I/O's are independent,
while what happens in practice is that the drive hits a UNC on one
I/O, and won't entertain any further I/O's no matter what flags the
request has on it or how many times you ``reset'' things.


Maybe you could try to rescue B_FAILFAST by putting clever statistics
into the driver to compare the drive's performance to recent past as I
suggested ZFS do, and admit no B_FAILFAST requests to queues of drives
that have suddenly slowed down, just fail them immediately without
even trying.  I submit this queueing and statistic collection is
actually _better_ managed by ZFS than the driver because ZFS can
compare a whole floating-point statistic across a whole vdev, while
even a driver which is fancier than we ever dreamed, is still playing
poker with only 1 bit of input ``I'll call,'' or ``I'll fold.''  ZFS
can see all the cards and get better results while being stupider and
requiring less clever poker-guessing than would be required by a
hypothetical driver B_FAILFAST implementation that actually worked.

Attachment: pgpqZb7GbAEgk.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to