On Wed, 23 Dec 2009 03:02:47 +0100, Mike Gerdts <mger...@gmail.com> wrote:

> I've been playing around with zones on NFS a bit and have run into
> what looks to be a pretty bad snag - ZFS keeps seeing read and/or
> checksum errors.  This exists with S10u8 and OpenSolaris dev build
> snv_129.  This is likely a blocker for anything thinking of
> implementing parts of Ed's Zones on Shared Storage:
>
> http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>
> The OpenSolaris example appears below.  The order of events is:
>
> 1) Create a file on NFS, turn it into a zpool
> 2) Configure a zone with the pool as zonepath
> 3) Install the zone, verify that the pool is healthy
> 4) Boot the zone, observe that the pool is sick
[...]
> r...@soltrain19# zoneadm -z osol boot
>
> r...@soltrain19# zpool status osol
>   pool: osol
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using 'zpool clear' or replace the device with 'zpool replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
>
>         NAME                  STATE     READ WRITE CKSUM
>         osol                  DEGRADED     0     0     0
>           /mnt/osolzone/root  DEGRADED     0     0   117  too many errors
>
> errors: No known data errors

Hey Mike, you're not the only victim of these strange CHKSUM errors, I hit
the same during my slightely different testing, where I'm NFS mounting an
entire, pre-existing remote file living in the zpool on the NFS server and use 
that to create a zpool and install zones into it.

I've filed today:

6915265 zpools on files (over NFS) accumulate CKSUM errors with no apparent 
reason

here's the relevant piece worth investigating out of it (leaving out the actual 
setup etc..)
as in your case, creating the zpool and installing the zone into it still gives
a healthy zpool, but immediately after booting the zone, the zpool served over 
NFS
accumulated CHKSUM errors.

of particular interest are the 'cksum_actual' values as reported by Mike for his
test case here:

http://www.mail-archive.com/zfs-disc...@opensolaris.org/msg33041.html

if compared to the 'chksum_actual' values I got in the fmdump error output on 
my test case/system:

note, the NFS servers zpool that is serving and sharing the file we use is 
healthy.

zone halted now on my test system, and checking fmdump:

osoldev.batschul./export/home/batschul.=> fmdump -eV | grep cksum_actual | sort 
| uniq -c | sort -n | tail
   2    cksum_actual = 0x4bea1a77300 0xf6decb1097980 0x217874c80a8d9100 
0x7cd81ca72df5ccc0
   2    cksum_actual = 0x5c1c805253 0x26fa7270d8d2 0xda52e2079fd74 
0x3d2827dd7ee4f21
   6    cksum_actual = 0x28e08467900 0x479d57f76fc80 0x53bca4db5209300 
0x983ddbb8c4590e40
*A   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 
0x89715e34fbf9cdc0
*B   7    cksum_actual = 0x0 0x0 0x0 0x0
*C  11    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 
0x280934efa6d20f40
*D  14    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 
0x7e0aef335f0c7f00
*E  17    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 
0xd4f1025a8e66fe00
*F  20    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
0x7f84b11b3fc7f80
*G  25    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 
0x82804bc6ebcfc0

osoldev.root./export/home/batschul.=> zpool status -v
  pool: nfszone
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        nfszone     DEGRADED     0     0     0
          /nfszone  DEGRADED     0     0   462  too many errors

errors: No known data errors

==========================================================================

now compare this with Mike's error output as posted here:

http://www.mail-archive.com/zfs-disc...@opensolaris.org/msg33041.html

# fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail

   2    cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62 
0x290cbce13fc59dce
*D   3    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 
0x7e0aef335f0c7f00
*E   3    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 
0xd4f1025a8e66fe00
*B   4    cksum_actual = 0x0 0x0 0x0 0x0
   4    cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900 
0x330107da7c4bcec0
   5    cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73 
0x4e0b3a8747b8a8
*C   6    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 
0x280934efa6d20f40
*A   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 
0x89715e34fbf9cdc0
*F  16    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
0x7f84b11b3fc7f80
*G  48    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 
0x82804bc6ebcfc0

and observe that the values in 'chksum_actual' causing our CHKSUM pool errors 
eventually
because of missmatching with what had been expected are the SAME ! for 2 totally
different client systems and 2 different NFS servers (mine vrs. Mike's),
see the entries marked with *A to *G.

This just can't be an accident, there must be some coincidence and thus there's 
a good chance
that these CHKSUM errors must have a common source, either in ZFS or in NFS ?

cheers
frankB

_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Reply via email to