> About that issue, please check my post in:
> http://www.opensolaris.org/jive/thread.jspa?threadID=48483&tstart=0
Thanks - when I originally tried to replace the first drive, my intention was
to:
1. Move solaris box and drives
2. Power up to test it still works
3. Power down
4. Replace drive.
I suspect I may have missed out 2 & 3, and ran into the same situation that you
did.
Anyhow, I seem to now be in an even bigger mess than earlier - when I tried to
simply swap out one of the old drives with a new one and perform a replace, I
ran into problems:
1. The hard drive light on the PC lit up, and I heard lots of disk noise, as
you would expect
2. The light went off. My continuous ping did the following:
Reply from 192.168.0.10: bytes=32 time<1ms TTL=255
Reply from 192.168.0.10: bytes=32 time<1ms TTL=255
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.0.10: bytes=32 time=2092ms TTL=255
Reply from 192.168.0.10: bytes=32 time<1ms TTL=255
Reply from 192.168.0.10: bytes=32 time<1ms TTL=255
3. The light came back on again .. more disk noise. Good - perhaps the pause
was just a momentary blip
4. Light goes off (this is about 20 minutes since the start)
5. zpool status reports that a resilver completed, and there are errors in
zp/storage and zp/VMware, and suggests that I should restore from backup
6. I nearly cry, as these are the only 2 files I use.
7. I have heard of ZFS thinking that there are unrecoverable errors before, so
I run zpool scrub and then zpool clear a number of times. Seem to make no
difference.
This whole project started when I wanted to move 900 GB of data from a server
2003 box containing the 4 old disks, to a solaris box. I borrowed 2 x 500 GB
drives from a friend, copied all the data onto them, put the 4 old drives into
the solaris box, created the zpool, created my storage and VMware volumes,
shared them out using iSCSI, created NTFS volumes on the server 2003 box and
copied the data back onto them. Aside from a couple of networking issues, this
worked absolutely perfectly. Then I decided I'd like some more space, and
that's where it all went wrong.
Despite the reports of corruption, the storage and VMware "drives" do still
work in windows. The iSCSI initiator still picks them up, and if I if I dir /a
/s, I can see all of the files that were on these NTFS volumes before I tried
this morning's replace. However, should I trust this? I suspect that even if I
ran a chkdsk /f, a successful result may not be all that it seems. I still have
the 2 x 500 GB drives with my data from weeks ago. I'd be sad to lose a few
weeks worth of work, but that would be better than assuming that ZFS is
incorrect in saying the volumes are corrupt and then discovering in months time
that I cannot get at NTFS files because of this root cause.
Since the report of corruption in these 2 volumes, I had a genius
troubleshooting idea - "what if the problem is not with ZFS, but instead with
Solaris not liking the drives in general?". I exported my current zpool,
disconnected all drives, plugged in the 4 new ones, and waited for the system
to boot again... nothing. The system had stopped in the BIOS, requesting that I
press F1 as SMART reports that one of the drives is bad! Already?!? I only
bought the drives a few days ago!!! Now the problem is that I know which of
these drives is bad, but I don't know whether this was the one that was plugged
in when zpool status reported all the read/write/checksum errors.
So maybe I have a duff batch of drives .. I leave the remaining 3 plugged in
and create a brand new zpool called test. No problems at all. I create a 1300
GB volume on it. Also no problem. I'm currently overwriting it with random data:
dd if=/dev/urandom of=/dev/zvol/rdsk/test/test bs=1048576 count=1331200
I throw in the odd zpool scrub to see how things are doing so far and as yet,
there hasn't been a single error of any sort. So, 3 of the WD drives (0430739,
0388708, 0417089) appear to be fine and one is dead already (0373211).
So this leads me to the conclusion that (ignoring the bad one), these drives
work fine with Solaris. They work fine with ZFS too. It's just the act of
trying to replace a drive from my old zpool with a new one that causes issues.
My next step will be to run the WD diagnostics on all drives, send the broken
one back, and then have 4 fully functioning 750 GB drives. I'll also import the
old zpool into the solaris box - it'll undoubtedly complain that one of the
drives is missing (the one that I tried to add earlier and got all the errors),
so I think I'll try one more replace to get all 4 old drives back in the pool.
So, what do I do after that?
1. Create a brand new pool out of the WD drives, share it using iSCSI and copy
onto that my data from my friends drives? I'll have lost a good few weeks of
work but I'll be confident that it isn't corrupt.
2. Igno