Re: [zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?

Richard Elling Mon, 08 Sep 2008 16:38:33 -0700

Karl Pielorz wrote:
>
>
> --On 08 September 2008 07:30 -0700 Richard Elling 
> <[EMAIL PROTECTED]> wrote:
>
>> This seems like a reasonable process to follow, I would have done
>> much the same.
>
>> [caveat: I've not examined the FreeBSD ZFS port, the following
>> presumes the FreeBSD port is similar to the Solaris port]
>> ZFS does not have its own timeouts for this sort of problem.
>> It relies on the underlying device drivers to manage their
>> timeouts.  So there was not much you could do at the ZFS level
>> other than detach the disk.
>
> Ok, I'm glad I'm finally getting the hang of ZFS, and 'did the right 
> thing(tm)'.
>
> Is there any tunable on ZFS that will tell it "If you get more than 
> x/y/z Read, Write or Checksum errors" - detach the drive as 'failed'? 
> Maybe on a per-drive basis?


This is the function of one or more diagnosis engines in Solaris.
Not all errors are visible to ZFS, it makes sense to diagnose the error
where it is visible -- usually at the device driver level.

>
> It'd probably need some way for admin to override it (i.e. force it to 
> be ignored)? - for those times where you either have to, or for a 
> drive you know will at least stand a chance of reading the rest of the 
> surface 'past' the errors.
>
> This would probably be set quite low for 'consumer' grade drives, and 
> moderately higher for 'enterprise' drives that don't "go out to lunch" 
> for extended periods while seeing if they can recover a block. You 
> could even default it to 'infinity' if that's what the current level is.
>
> It'd certainly have saved me a lot of time if the number of errors on 
> the drive had past a relatively low figure, and it just ditched the 
> drive...

In Solaris, this is implemented through the FMA diagnosis engines
which communicate with interested parties, such as ZFS.  At present,
the variables really aren't tunable, per se, but you can see the values
in the source.  For example, the ZFS diagnosis engine is:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c

>
> One other random thought occurred to me when this happened - if I 
> detach a drive, does ZFS have to update some meta-data on *all* the 
> drives for that pool (including the one I've detached) to know it's 
> been detached? (if that makes sense).

Yes.

>
> That might explain why the 'detach' I issued just hung (if it had to 
> update meta-data on the drive I was removing, it probably got caught 
> in the wash of failing I/O timing out on that device).

Yes, I believe this is consistent with what you saw.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Failing Drive procedure (mirrored pairs) - did I mess this up?

Reply via email to