I just got a call from another of our admins, as I am the resident ZFS
expert, and they have opened a support case with Oracle, but I figured
I'd ask here as well, as this forum often provides better, faster
answers :-)

    We have a server (M4000) with 6 FC attached SE-3511 disk arrays
(some behind a 6920 DSP engine). There are many LUNs, all about 500 GB
and mirrored via ZFS. The LUNs from one tray from one of the direct
attached 3511's went offline this morning. At that point 1/3 of the
LUNs from this one array were affected and UNAVAIL. We had sufficient
unused LUNs in the right places to substitute for the 9 failed LUNs,
so I started a zpool replace to an unused LUN on another array. At
this point we had the other 2/3 of the LUNs from this 3511 go offline.
So one entire 3511 out of six was offline. No data loss, no major
issues as everything is mirrored across arrays.

    Here is where the real problem starts. In order to get the failed
3511 back online, we did a cold start of it (this is actually a known
failure mode, one we had not seen in over a year, where one drive
failing takes out an entire tray in the 3511, and a large part of why
we are using ZFS to mirror across 3511 RAID arrays). This brought all
the LUNs from this 3511 back (although we have not yet done the zpool
clear or online to bring all the vdevs back on line yet).
Unfortunately, at some point in here on the vdev that was resilvering,
the remaining good device tossed errors (probably transient FC errors
as the restarting 3511 logged back onto the fabric), but that caused
ZFS to mark that device as UNAVAIL. We tested access to that LUN with
dd and we can read data from it. We tried to zpool online this device,
but the zpool command has not returned for over 5 minutes.

    Is there a way (other than zpool online) to kick ZFS into
rescanning the LUNs ?

---or---

    Are we going to have to export the zpool and then import it ?

    I wanted to get opinions here while the folks on site are running
this past Oracle Support. I am NOT on site and can't pull config info
for the faulted zpool, but what I recall is that is is composed of 11
mirror vdevs, each about 500 GB. Only one of the vdevs is FAULTED (the
one that was resilvering), but two or three others have devices that
are UNAVAIL or FAULTED (but the vdev is degraded and not faulted).

    If I had realized the entire 3511 array had gone away and that we
would be restarting it, I would NOT have attempted to replace the
faulted LUN and we would probably be OK.

    Needless to say we really don't want to have to restore the data
from the backup (a ZFS send / recv replica at the other end of a 100
Mbps pipe), but we can if we cannot recover the data in the zpool.

    So what is the best option here ?

    And why isn't the zpool online returning ? The system is running
10U9 with (I think) the September 2010 CPU and a couple multipathing /
SAS / SATA point patches (for a MPxIO and SATA bug we found). zpool
version is 22 zfs version is either 4 or 5 (I forget which). We are
moving off of the 3511s and onto a stack of five J4400 with 750 GB
SATA drives, but we aren't there yet :-(

P.S. The other zpools on the box are still up and running. The ones
that had deviceson the faulted 3511 are degraded but online, the ones
that did not have devices on the faulted 3511 are OK. Because of these
other zpools we can't really reboot the box or pull the FC
connections.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to