On Sat, Apr 07, 2007 at 05:05:18PM -0500, in a galaxy far far away, Chris 
Csanady said:
> In a recent message, I detailed the excessive checksum errors that
> occurred after replacing a disk.  It seems that after a resilver
> completes, it leaves a large number of blocks in the pool which fail
> to checksum properly.  Afterward, it is necessary to scrub the pool in
> order to correct these errors.
> 
> After some testing, it seems that this only occurs with RAID-Z.  The
> same behavior can be observed on both snv_59 and snv_60, though I do
> not have any other installs to test at the moment.

A colleague at work and I have followed the same steps, included running a 
digest on the /test/file, on a SXCE:61 build today and can confirm the exact 
same,
and disturbing?, result.
My colleague mentioned to me he has witnessed the same 'resilver' behavior on
builds 57 and 60.

The box which these steps were performed on was 'luupgraded' from SXCE: 60 to 
61 using the SUNWlu* packages from
61!

# cat /etc/release
                            Solaris Nevada snv_61 X86
           Copyright 2007 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                             Assembled 26 March 2007

# mkdir /tmp/test
# mkfile 64m /tmp/test/0 /tmp/test/1
# zpool create test raidz /tmp/test/0 /tmp/test/1
# mkfile 16m /test/file
# digest -v -a sha1 /test/file
sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2
# 
# zpool export test
# rm /tmp/test/0
# zpool import -d /tmp/test test
# mkfile 64m /tmp/test/0
# zpool replace test /tmp/test/0
# digest -v -a sha1 /test/file
sha1 (/test/file) = 3b4417fc421cee30a9ad0fd9319220a8dae32da2
# zpool status test
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Wed Apr 11 15:19:15 2007
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          raidz1         ONLINE       0     0     0
            /tmp/test/0  ONLINE       0     0     0
            /tmp/test/1  ONLINE       0     0     0

errors: No known data errors
# zpool scrub test
#
# zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Wed Apr 11 15:22:30 2007
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          raidz1         ONLINE       0     0     0
            /tmp/test/0  ONLINE       0     0    17
            /tmp/test/1  ONLINE       0     0     0

errors: No known data errors

I don't think these checksum errors are a good sign. 
The sha1 digest on the file *does* show to be the same so the question arises:
is the resilver process truly broken (even though in this test-case the test
file does appear to unchanged based on the sha1 digest) ?

Marco

-- 
# make mistake
make: don't know how to make mistake. Stop
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to