Re: [zfs-discuss] Problem with missing disk in RaidZ

Miles Nordin Fri, 20 Jun 2008 10:48:25 -0700

>>>>> "ph" == Peter Hawkins <[EMAIL PROTECTED]> writes:


    ph> Tried zpool replace. Unfortunately that takes me back into the
    ph> cycle where as soon as the resilver starts the system hangs,
    ph> not even CAPS Lock works. When I reset the system I have about
    ph> a 10 second window to detach the device again to get the
    ph> system back before it freezes.

I had problems like this with a firewire disk serving as a mirror
component on b44.

 1) if half of the mirror went away unexpectedly without 'offline'ing
    it first, zpool would later (after bringing the device back) show
    checksum errors accumulating on the bounced part of the mirror for
    weeks after.

 2) if I try to fix this with 'zpool scrub' it would announce it
    expected the scrub of ~200GB to take 7 hours, and immediately the
    system was 1/10th speed or less for anything that used the disk,
    like a web browser writing to cache, but xterm was still fine.

    After about 1 hour the system stopped accessing the disk.  It did
    not panic, and I think the mouse pointer still moved, but windows
    couldn't be moved or raised.  I could never complete a scrub.

I think (2) might have had something to do with (1).  My firewire case
had a bad Prolific chipset, and every two days ot so the firewire case
crashed and needed to be rebooted.  This is documented on the web as
happening with other operating systems, and does not happen under
Solaris with my Oxford 911 case.  This problem is why (1) kept
happening to me.  For (2), I bet the case was crashing during the
scrub.  I replaced that firewire case with an iSCSI target (and
continued using another firewire case with an Oxford 911 chip in it),
and (1) and (2) both went away.  well, the system is still useless
during a scrub, but by ``went away'' I mean it doesn't lock up---it
eventually finishes.

so, I would suggest exercising each of your devices.  Maybe one of the
disks or cables is bad (and not necessarily the one you're replacing).
In my (highly anecdotal) experience, ZFS isn't robust to failures
during a scrub, only to failures that happen not during a scrub.
Probably the actual situation is more tangled and not exactly what I
said.

I have two ways of ``exercising'' devices.  one:

 dd if=/dev/rdsk/cxtxdxs2 of=/dev/null bs=$(( 56 * 1024 ))   (SMD label)
 dd if=/dev/rdsk/cxtxdx of=/dev/null bs=$(( 56 * 1024 ))     (EFI label)

this tests the disks, controllers, and cables.  It should be safe to
do on a running system, and probably won't slow down your system as
much as a ZFS scrub.  Watch for I/O errors received by 'dd' and for
more detailed errors in dmesg.

another:

 smartctl -t long /dev/rdsk/...

 smartctl -a /dev/rdsk/...        (check that the test is running,
                                   has an in progress fow in the test 
                                   log at the bottom, and how long 
                                   it should take)

 smartctl -a /dev/rdsk/...        (check that it does not say ``aborted 
                                   by host command''.  the test is 
                                   supposed to run in the background, 
                                   but with some old or dumb disks, 
                                   it doesn't)
 [wait several hours]
 smartctl -a /dev/rdsk/...        (check that a new row as appeared 
                                   in the test log at the bottom, and 
                                   that it says Extended offline, 
                                   Completed without error)

This will test that the disk is good, regardless of the
controller/driver/cable.  The two tests together can help isolate a
problem after you know that you have it.  The '-t long' also should
not harm performance as much as a zpool scrub---the disks are supposed
to be smart about it.

I'm not sure what to do to test your last disk.  You could try to
'zpool offline' the UNAVAIL disk and see if that stops ZFS from trying
to open it, but this hasn't worked perfectly for me.  You could test
the disk in another machine, but this doesn't test
driver/controller/cable.  If you can't work anything else out, you can
boot your system with 'boot -m milestone=none' to prevent ZFS from
coming up, and do your testing there.  This is what I have to do to
remove iSCSI targets for which ZFS is `patiently waiting' forever.


For problem (1) mentioned way at the top of my mail, I found I could
avoid the checksum errors in (1) by 'zpool offline'ing iSCSI and
firewire targets before I take them away.  When I bring them back
online in that case, the brief resilver DOES do enough to avoid
checksum errors accumulating later.  I would say the (1) problem is
probably in Linux IET rather than ZFS because I'm highly suspicious of
Linux developers' ability to understand synchronize-cache or write
barriers or anything of that sort, except that (1) happened with
firewire too so I think it is a real ZFS problem.  I don't really find
this acceptable, but at least it's repeatable so it's possible for me
to do maintenance without suffering a day of unavailability for
scrubbing.

However I still have some problems because ZFS seems to like to bring
iSCSI targets online all by itself when they reappear, which is good
sometimes, but it does this even if I've offlined them manually which
I found surprising because the documentation makes it sound like
marking something offline is supposed to survive even reboots.  and
because of some crappyness with the Linux iSCSI target, I sometimes
have to restart the target and add/remove the solaris initiator's
``discovery address'' to get the connection to come back up.  The
first time the connection comes up, ZFS onlines the target, and then
when I later bounce it I run into problem (1) and get checksum errors
later.  so I still have to do scrubs which for 1TB with my slowass
setup can take more than a day of too-slow-to-play-video.

I have not tested it, though, with bouncing iSCSI targets _during_ the
scrub.  I can try it if someone cares.

pgpMi6uxVILoK.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Problem with missing disk in RaidZ

Reply via email to