Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Bill Sommerfeld Thu, 28 Aug 2008 14:48:10 -0700

On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote:
> A better option would be to not use this to perform FMA diagnosis, but
> instead work into the mirror child selection code.  This has already
> been alluded to before, but it would be cool to keep track of latency
> over time, and use this to both a) prefer one drive over another when
> selecting the child and b) proactively timeout/ignore results from one
> child and select the other if it's taking longer than some historical
> standard deviation.  This keeps away from diagnosing drives as faulty,
> but does allow ZFS to make better choices and maintain response times.
> It shouldn't be hard to keep track of the average and/or standard
> deviation and use it for selection; proactively timing out the slow I/Os
> is much trickier.


tcp has to solve essentially the same problem: decide when a response is
"overdue" based only on the timing of recent successful exchanges in a
context where it's difficult to make assumptions about "reasonable"
expected behavior of the underlying network.

it tracks both the smoothed round trip time and the variance, and
declares a response overdue after (SRTT + K * variance).

I think you'd probably do well to start with something similar to what's
described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on
experience.

                                        - Bill





_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Reply via email to