>>>>> "ph" == Peter Hawkins <[EMAIL PROTECTED]> writes:
ph> Tried zpool replace. Unfortunately that takes me back into the ph> cycle where as soon as the resilver starts the system hangs, ph> not even CAPS Lock works. When I reset the system I have about ph> a 10 second window to detach the device again to get the ph> system back before it freezes. I had problems like this with a firewire disk serving as a mirror component on b44. 1) if half of the mirror went away unexpectedly without 'offline'ing it first, zpool would later (after bringing the device back) show checksum errors accumulating on the bounced part of the mirror for weeks after. 2) if I try to fix this with 'zpool scrub' it would announce it expected the scrub of ~200GB to take 7 hours, and immediately the system was 1/10th speed or less for anything that used the disk, like a web browser writing to cache, but xterm was still fine. After about 1 hour the system stopped accessing the disk. It did not panic, and I think the mouse pointer still moved, but windows couldn't be moved or raised. I could never complete a scrub. I think (2) might have had something to do with (1). My firewire case had a bad Prolific chipset, and every two days ot so the firewire case crashed and needed to be rebooted. This is documented on the web as happening with other operating systems, and does not happen under Solaris with my Oxford 911 case. This problem is why (1) kept happening to me. For (2), I bet the case was crashing during the scrub. I replaced that firewire case with an iSCSI target (and continued using another firewire case with an Oxford 911 chip in it), and (1) and (2) both went away. well, the system is still useless during a scrub, but by ``went away'' I mean it doesn't lock up---it eventually finishes. so, I would suggest exercising each of your devices. Maybe one of the disks or cables is bad (and not necessarily the one you're replacing). In my (highly anecdotal) experience, ZFS isn't robust to failures during a scrub, only to failures that happen not during a scrub. Probably the actual situation is more tangled and not exactly what I said. I have two ways of ``exercising'' devices. one: dd if=/dev/rdsk/cxtxdxs2 of=/dev/null bs=$(( 56 * 1024 )) (SMD label) dd if=/dev/rdsk/cxtxdx of=/dev/null bs=$(( 56 * 1024 )) (EFI label) this tests the disks, controllers, and cables. It should be safe to do on a running system, and probably won't slow down your system as much as a ZFS scrub. Watch for I/O errors received by 'dd' and for more detailed errors in dmesg. another: smartctl -t long /dev/rdsk/... smartctl -a /dev/rdsk/... (check that the test is running, has an in progress fow in the test log at the bottom, and how long it should take) smartctl -a /dev/rdsk/... (check that it does not say ``aborted by host command''. the test is supposed to run in the background, but with some old or dumb disks, it doesn't) [wait several hours] smartctl -a /dev/rdsk/... (check that a new row as appeared in the test log at the bottom, and that it says Extended offline, Completed without error) This will test that the disk is good, regardless of the controller/driver/cable. The two tests together can help isolate a problem after you know that you have it. The '-t long' also should not harm performance as much as a zpool scrub---the disks are supposed to be smart about it. I'm not sure what to do to test your last disk. You could try to 'zpool offline' the UNAVAIL disk and see if that stops ZFS from trying to open it, but this hasn't worked perfectly for me. You could test the disk in another machine, but this doesn't test driver/controller/cable. If you can't work anything else out, you can boot your system with 'boot -m milestone=none' to prevent ZFS from coming up, and do your testing there. This is what I have to do to remove iSCSI targets for which ZFS is `patiently waiting' forever. For problem (1) mentioned way at the top of my mail, I found I could avoid the checksum errors in (1) by 'zpool offline'ing iSCSI and firewire targets before I take them away. When I bring them back online in that case, the brief resilver DOES do enough to avoid checksum errors accumulating later. I would say the (1) problem is probably in Linux IET rather than ZFS because I'm highly suspicious of Linux developers' ability to understand synchronize-cache or write barriers or anything of that sort, except that (1) happened with firewire too so I think it is a real ZFS problem. I don't really find this acceptable, but at least it's repeatable so it's possible for me to do maintenance without suffering a day of unavailability for scrubbing. However I still have some problems because ZFS seems to like to bring iSCSI targets online all by itself when they reappear, which is good sometimes, but it does this even if I've offlined them manually which I found surprising because the documentation makes it sound like marking something offline is supposed to survive even reboots. and because of some crappyness with the Linux iSCSI target, I sometimes have to restart the target and add/remove the solaris initiator's ``discovery address'' to get the connection to come back up. The first time the connection comes up, ZFS onlines the target, and then when I later bounce it I run into problem (1) and get checksum errors later. so I still have to do scrubs which for 1TB with my slowass setup can take more than a day of too-slow-to-play-video. I have not tested it, though, with bouncing iSCSI targets _during_ the scrub. I can try it if someone cares.
pgpMi6uxVILoK.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss