Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-24 Thread Chris Murray
Cheers, I did try that, but still got the same total on import - 2.73TB

I even thought I might have just made a mistake with the numbers, so I made a 
sort of 'quarter scale model' in VMware and OSOL 2009.06, with 3x250G and 
1x187G. That gave me a size of 744GB, which is *approx* 1/4 of what I get in 
the physical machine. That makes sense. I then replaced the 187 with another 
250, still 744GB total, as expected. Exported & imported - now 996GB. So, the 
export and import process seems to be the thing to do, but why it's not working 
on my physical machine (SXCE119) is a mystery. I even contemplated that there 
might have still been a 750GB drive left in the setup, but they're all 1TB 
(well, 931.51GB).

Any ideas what else it could be?

For anyone interested in the checksum/permanent error thing, I'm running a 
scrub now. 59% done and not one error.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-24 Thread Chris Borrell
Try exporting and reimporting the pool. That has done the trick for me in the 
past
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-22 Thread Chris Murray
I've had an interesting time with this over the past few days ...

After the resilver completed, I had the message "no known data errors" in a 
zpool status.

I guess the title of my post should have been "how permanent are permanent 
errors?". Now, I don't know whether the action of completing the resilver was 
the thing that fixed the one remaining error (in the snapshot of the 'meerkat' 
zvol), or whether my looped zpool clear commands have done it. Anyhow, for 
space/noise reasons, I set the machine back up with the original cables 
(eSATA), in its original tucked-away position, installed SXCE 119 to get me 
remotely up to date, and imported the pool.

So far so good. I then powered up a load of my virtual machines. None of them 
report errors when running a chkdsk, and SQL Server 'DBCC CHECKDB' hasn't 
reported any problems yet. Things are looking promising on the corruption front 
- feels like the errors that were reported while the resilvers were in progress 
have finally been fixed by the final (successful) resilver! Microsoft Exchange 
2003 did complain of corruption of mailbox stores, however I have seen this a 
few times as a result of unclean shutdowns, and don't think it's related to the 
errors that ZFS was reporting on the pool during resilver.

Then, 'disk is gone' again - I think I can definitely put my original troubles 
down to cabling, which I'll sort out for good in the next few days. Now, I'm 
back on the same SATA cables which saw me through the resilvering operation.

One of the drives is showing read errors when I run dmesg. I'm having one 
problem after another with this pool!! I think the disk I/O during the resilver 
has tipped this disk over the edge. I'll replace it ASAP, and then I'll test 
the drive in a separate rig and RMA it.

Anyhow, there is one last thing that I'm struggling with - getting the pool to 
expand to use the size of the new disk. Before my original replace, I had 3x1TB 
and 1x750GB disk. I replaced the 750 with another 1TB, which by my reckoning 
should give me around 4TB as a total size even after checksums and metadata. No:

# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
rpool74G  8.81G  65.2G11%  ONLINE  -
zp 2.73T  2.36T   379G86%  ONLINE  -

2.73T? I'm convinced I've expanded a pool in this way before. What am I missing?

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-20 Thread Chris Murray
Ok, the resilver has been restarted a number of times over the past few days 
due to two main issues - a drive disconnecting itself, and power failure. I 
think my troubles are 100% down to these environmental factors, but would like 
some confidence that after the resilver has completed, if it reports there 
aren't any persistent errors, that there actually aren't any.

Attempt #1: the resilver started after I initiated the replace on my SXCE105 
install. All was well until the box lost power. On starting back up, it hung 
while starting OpenSolaris - just after the line containing the system 
hostname. I've had this before when a scrub is in progress. My usual tactic is 
to boot with the 2009.06 live CD, import the pool, stop the scrub, export, 
reboot into SXCE105 again, and import. Of course, you can't stop a replace 
that's in progress, so the remaining attempts are in the 2009.06 live CD (build 
111b perhaps?)

Attempt #2: the resilver started on imported the pool in 2009.06. It was 
resilvering fine until one drive reported itself as offline. dmesg showed that 
the drive was 'gone'. I then noticed a lot of checksum errors at the pool 
level, and RAIDZ1 level, and a large number of 'permanent' errors. In a panic, 
thinking that the resilver was now doing more harm than good, I exported the 
pool and rebooted.

Attempt #3: I imported in 2009.06 again. This time, the drive that was 
disconnected last attempt was online again, and proceeded to resilver along 
with the original drive. There was only one permanent error - in a particular 
snapshot of a ZVOL I'm not too concerned about. This is the point that I wrote 
the original post, wondering if all of those 700+ errors reported the first 
time around weren't a problem any more. I have been running zpool clear in a 
loop because there were checksum errors on another of the drives (neither of 
the two part of the replacing vdev, and not the one that was removed 
previously). I didn't want it to be marked as faulty, so I kept the zpool clear 
running. Then .. power failure.

Attempt #4: I imported in 2009.06. This time, no errors detected at all. Is 
that a result of my zpool clear? Would that clear any 'permanent' errors? From 
the wording, I'd say it wouldn't, and therefore the action of starting the 
resilver again with all of the correct disks in place hasn't found any errors 
so far ... ? Then, disk removal again ... :-(

Attempt #5: I'm convinced that drive removal is down to faulty cabling. I move 
the machine, completely disconnect all drives, re-wire all connections with new 
cables, and start the scrub again in 2009.06. Now, there are checksum errors 
again, so I'm running zpool clear in order to keep drives from being marked as 
faulted .. but I also have this:

errors: Permanent errors have been detected in the following files:
zp/iscsi/meerkat_t...@20090905_1631:<0x1>

I have a few of my usual VMs powered up (ESXi connecting using NFS), and they 
appear to be fine. I've ran a chkdsk in the windows VMs, and no errors are 
reported. Although I can't be 100% confident that any of those files were in 
the original list of 700+ errors. In the absence of iscsitgtd, I'm not powering 
up the ones that rely on iSCSI just yet.

My next steps will be:
1. allow the resilver to finish. Assuming I don't have yet another power cut, 
this will be in about 24 hours.
2. zpool export
3. reboot into SXCE
4. zpool import
5. start all my usual virtual machines on the ESXi host
6. note whether that permanent error is still there <-- this will be an 
interesting one for me - will the export & import clear the error? will my 
looped zpool clear have simply reset the checksum counters to zero, or will it 
have cleared this too?
7. zpool scrub to see what else turns up.

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-19 Thread Victor Latushkin

On 17.09.09 21:44, Chris Murray wrote:

Thanks David. Maybe I mis-understand how a replace works? When I added disk
E, and used 'zpool replace [A] [E]' (still can't remember those drive names),
I thought that disk A would still be part of the pool, and read from in order
to build the contents of disk E?


Exactly. Disks A and E will be arranged into special vdev of type 'replacing' 
beneath the raidz vdev which behaves like a mirror. As soon as resilvering is 
complete, disk A will be removed from this 'replacing' mirror making disk E to 
stay alone in the raidz vdev.


victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-19 Thread Victor Latushkin

On 17.09.09 13:29, Chris Murray wrote:

I can flesh this out with detail if needed, but a brief chain of events is:


It would be nice to know what OS version/build/patchlevel are you running.


1. RAIDZ1 zpool with drives A, B, C & D (I don't have access to see
original drive names)
2. New disk E. Replaced A with E.
3. Part way through resilver, drive D was 'removed'
4. 700+ persistent errors detected, and lots of checksum errors on all
drives. Surprised by this - I thought the absence of one drive could be
tolerated?
5. Exported, rebooted, imported. Drive D present now. Good. :-)
6. Drive D disappeared again. Bad.  :-(
7. This time, only one persistent error.

Does this mean that there aren't errors in the other 700+ files that it
reported the first time, or have I lost my chance to note these down, and they
are indeed still corrupt?


It depends on where that one persistent error is. If it is in some filsystem 
metadata, ZFS may no longer be able to reach to other error blocks as a result...


So it's impossible to tell without a bit more details.

victor


I've re-ran step 5 again, so it is now on the third attempted resilver.
Hopefully drive D won't remove itself again, and I'll actually have 30+ hours of
stability while the new drive resilvers ...



Chris

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-17 Thread Chris Murray
Thanks David. Maybe I mis-understand how a replace works? When I added disk E, 
and used 'zpool replace [A] [E]' (still can't remember those drive names), I 
thought that disk A would still be part of the pool, and read from in order to 
build the contents of disk E? Sort of like a safer way of doing the old 'swap 
one drive at a time' trick with RAID-5 arrays?

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-17 Thread David Dyer-Bennet

On Thu, September 17, 2009 04:29, Chris Murray wrote:

> 2. New disk E. Replaced A with E.
> 3. Part way through resilver, drive D was 'removed'
> 4. 700+ persistent errors detected, and lots of checksum errors on all
> drives. Surprised by this - I thought the absence of one drive could be
> tolerated?

On a RAIDZ, the absence of one drive can be tolerated.  But note that you
said "part way through the resilver".  Drive E is not fully present, AND
drive D was removed -- you have 1+ drives missing, in a configuration that
can tolerate only 1 drive missing.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Persistent errors - do I believe?

2009-09-17 Thread Chris Murray
I can flesh this out with detail if needed, but a brief chain of events is:

1. RAIDZ1 zpool with drives A, B, C & D (I don't have access to see original 
drive names)
2. New disk E. Replaced A with E.
3. Part way through resilver, drive D was 'removed'
4. 700+ persistent errors detected, and lots of checksum errors on all drives. 
Surprised by this - I thought the absence of one drive could be tolerated?
5. Exported, rebooted, imported. Drive D present now. Good. :-)
6. Drive D disappeared again. Bad.  :-(
7. This time, only one persistent error.

Does this mean that there aren't errors in the other 700+ files that it 
reported the first time, or have I lost my chance to note these down, and they 
are indeed still corrupt?

I've re-ran step 5 again, so it is now on the third attempted resilver. 
Hopefully drive D won't remove itself again, and I'll actually have 30+ hours 
of stability while the new drive resilvers ...

Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss