Karl Pielorz wrote:
> Hi All,
>
> I run ZFS (a version 6 pool) under FreeBSD. Whilst I realise this changes a 
> *whole heap* of things - I'm more interested in if I did 'anything wrong' 
> when I had a recent drive failure...
>
> On of a mirrored pair of drives on the system started failing, badly 
> (confirmed by 'hard' read & write erros logged to the console). ZFS also 
> started showing errors, the machine started hanging, waiting for I/O's to 
> complete (which is how I noticed it).
>
> How many errors does a drive have to throw before it's considered "failed" 
> by ZFS? - Mine had got to about 30-40 [not a huge amount] - but was making 
> the system unusable, so I manually attached another hot-spare drive to the 
> 'good' device left in that mirrored pair.
>
> However, ZFS was still trying to read data off the failing drive - this 
> pushed the re-silver time up to 755 hours, whilst the number of errors in 
> the next forty minutes or so got to around 300. Not wanting my data 
> unprotected for 755 odd hours (and fearing the number was just going up and 
> up) I did:
>
>   zpool detach vol ad4
>
> ('ad4' was the failing drive).
>
> This hung all I/O on the pool :( - I waited 5 hours, and then decided to 
> reboot.
>   

This seems like a reasonable process to follow, I would have done
much the same.

> After the reboot the pool came back OK (with 'ad4' removed) and the 
> re-silver continued, and completed in half an hour.
>   

There are failure modes that disks can get into which seem to be
solved by a power-on reset.  I had one of these just last week :-(.
We would normally expect a soft reset to clear the cobwebs, but
that was not my experience.

> Thinking about it - perhaps I should have detached ad4 (the failing drive) 
> before attaching another device? - My thinking at the time was I didn't 
> know how badly failed the drive was, and obviously removing what might have 
> been 200Gb of 'perfectly' accessible data from a mirrored pair, prior to 
> re-silvering to a replacement, didn't sit right.
>
> I'm hoping ZFS shouldn't have hung when I later decided to fix the 
> situation, and remove ad4?
>   

[caveat: I've not examined the FreeBSD ZFS port, the following
presumes the FreeBSD port is similar to the Solaris port]
ZFS does not have its own timeouts for this sort of problem.
It relies on the underlying device drivers to manage their
timeouts.  So there was not much you could do at the ZFS level
other than detach the disk.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to