We have a number of shared spares configured in our ZFS pools, and
we're seeing weird issues where spares don't get used under some
circumstances. We're running Solaris 10 U6 using pools made up of
mirrored vdevs, and what I've seen is:
* if ZFS detects enough checksum errors on an active disk, it will
automatically pull in a spare.
* if the system reboots without some of the disks available (so that
half of the mirrored pairs drop out, for example), spares will *not*
get used. ZFS recognizes that the disks are not there; they are marked
as UNAVAIL and the vdevs (and pools) as DEGRADED, but it doesn't try to
use spares.
(This is in a SAN environment where half of all of the mirrors come
from one controller and half come from another one.)
All of this makes me think that I don't understand how ZFS spares
really work, and under what circumstances they'll get used. Does
anyone know if there's a writeup of this somewhere?
(What I've gathered so far from reading zfs-discuss archives is that
ZFS spares are not handled automatically in the kernel code but are
instead deployed to pools by a fmd ZFS management module[*], doing more
or less 'zpool repace pool failing-dev spare' (presumably through
an internal code path, since 'zpool history' doesn't seem to show spare
deployment). Is this correct?)
Also, searching turns up some old zfs-discuss messages suggesting that
not bringing in spares in response to UNAVAIL disks was a bug that's now
fixed in at least OpenSolaris. If so, does anyone know if the fix has
made it into S10 U7 (or is planned or available as a patch)?
Thanks in advance.
- cks
[*: http://blogs.sun.com/eschrock/entry/zfs_hot_spares suggests that
it is 'zfs-retire', which is separate from 'zfs-diagnosis'.]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss