Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-23 Thread Jürgen Keil
I wrote:
 Bill Sommerfeld wrote:
  On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote:
I ran a scrub on a root pool after upgrading to snv_94, and got 
checksum errors:
   
   Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
   on a system that is running post snv_94 bits:  It also found checksum 
   errors
   
  once is accident.  twice is coincidence.  three times is enemy action :-)
  
  I'll file a bug as soon as I can 
 
 I filed 6727872, for the problem with zpool scrub checksum errors
 on unmounted zfs filesystems with an unplayed ZIL.

6727872 has already been fixed, in what will become snv_96.

For my zpool, zpool scrub doesn't report checksum errors any more.

But: something is still a bit strange with the data reported by zpool status.
The error counts displayed by zpool status are all 0 (during the scrub, and when
the scrub has completed), but when zpool scrub completes it tells me that
scrub completed after 0h58m with 6 errors.  But it doesn't list the errors.

# zpool status -v files
  pool: files
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: scrub in progress for 0h57m, 99.39% done, 0h0m to go
config:

NAME  STATE READ WRITE CKSUM
files ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c8t0d0s6  ONLINE   0 0 0
c9t0d0s6  ONLINE   0 0 0

errors: No known data errors


# zpool status -v files
  pool: files
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: scrub completed after 0h58m with 6 errors on Wed Jul 23 18:23:00 2008
config:

NAME  STATE READ WRITE CKSUM
files ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c8t0d0s6  ONLINE   0 0 0
c9t0d0s6  ONLINE   0 0 0

errors: No known data errors
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-21 Thread Jürgen Keil
Bill Sommerfeld wrote:

 On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote:
   I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
   errors:
  
  Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
  on a system that is running post snv_94 bits:  It also found checksum errors
  
 
 out of curiosity, is this a root pool?  

It started as standard pool, and is using version 3 zpool format.

I'm using a small ufs root, and have /usr as a zfs filesystem on
that pool.

At some point in the past i did setup a zfs root and /usr filesystem
for experimenting with xVM unstable bits.


 A second system of mine with a mirrored root pool (and an additional
 large multi-raidz pool) shows the same symptoms on the mirrored root
 pool only.
 
 once is accident.  twice is coincidence.  three times is enemy action :-)
 
 I'll file a bug as soon as I can (I'm travelling at the moment with
 spotty connectivity), citing my and your reports.

Btw. I also found the scrub checksum errors on a non-mirrored zpool
(laptop with only one hdd).

And on one zpool that was using a non-mirrored, striped pool on two
S-ATA drives.


I think that in my case the cause for the scrub checksum errors is an
open ZIL transaction on an *unmounted* zfs filesystem.  In the past
such a zfs state prevented creating snapshots for the unmounted zfs,
see bug 6482985, 6462803.  That is still the case.  But now it also
seems to trigger checksum errors for a zpool scrub.

Stack backtrace for the ECKSUM (which gets translated into EIO errors
in arc_read_done()):

  1  64703   arc_read_nolock:return, rval 5
  zfs`zil_read_log_block+0x140
  zfs`zil_parse+0x155
  zfs`traverse_zil+0x55
  zfs`scrub_visitbp+0x284
  zfs`scrub_visit_rootbp+0x4e
  zfs`scrub_visitds+0x82
  zfs`dsl_pool_scrub_sync+0x109
  zfs`dsl_pool_sync+0x158
  zfs`spa_sync+0x254
  zfs`txg_sync_thread+0x226
  unix`thread_start+0x8




Does a zdb -ivv {pool} report any ZIL headers with a claim_txg != 0
on your pools?  Is the dataset that is associated with such a ZIL an
unmounted zfs?

# zdb -ivv files | grep claim_txg
ZIL header: claim_txg 5164405, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 5164405, seq 0
ZIL header: claim_txg 0, seq 0


# zdb -i files/matrix-usr
Dataset files/matrix-usr [ZPL], ID 216, cr_txg 5091978, 2.39G, 192089 objects

ZIL header: claim_txg 5164405, seq 0

first block: [L0 ZIL intent log] 1000L/1000P DVA[0]=0:12421e:1000 
zilog uncompressed LE contiguous birth=5163908 fill=0 
cksum=c368086f1485f7c4:39a549a81d769386:d8:3

Block seqno 3, already claimed, [L0 ZIL intent log] 1000L/1000P 
DVA[0]=0:12421e:1000 zilog uncompressed LE contiguous birth=5163908 
fill=0 cksum=c368086f1485f7c4:39a549a81d769386:d8:3


On two of my zpools I've eliminated the zpool scrub checksum errors by
mounting /  unmounting the zfs with the unplayed ZIL.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-21 Thread Jürgen Keil
Rustam wrote:
 
 I'm living with this error for almost 4 months and probably have record
 number of checksum errors:

 # zpool status -xv
   pool: box5
...
 errors: Permanent errors have been detected in the
 following files:
  
 box5:0x0

 I've Sol 10 U5 though.

I suspect that this (S10u5)  is a different issue, because for my
system's pool it seems to be caused by the opensolaris putback
on July 07th  for these fixes:

6343667 scrub/resilver has to start over when a snapshot is taken
6343693 'zpool status' gives delayed start for 'zpool scrub'
6670746 scrub on degraded pool return the status of 'resilver completed'?
6675685 DTL entries are lost resulting in checksum errors
6706404 get_history_one() can dereference off end of hist_event_table[]
6715414 assertion failed: ds-ds_owner != tag in dsl_dataset_rele()
6716437 ztest gets SEGV in arc_released()
6722838 bfu does not update grub
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-20 Thread Bill Sommerfeld
On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote:
  I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
  errors:
 
 Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
 on a system that is running post snv_94 bits:  It also found checksum errors
 
 # zpool status files
   pool: files
  state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
 config:
 
   NAME  STATE READ WRITE CKSUM
   files DEGRADED 0 018
 mirror  DEGRADED 0 018
   c8t0d0s6  DEGRADED 0 036  too many errors
   c9t0d0s6  DEGRADED 0 036  too many errors
 
 errors: No known data errors

out of curiosity, is this a root pool?  

A second system of mine with a mirrored root pool (and an additional
large multi-raidz pool) shows the same symptoms on the mirrored root
pool only.

once is accident.  twice is coincidence.  three times is enemy
action :-)

I'll file a bug as soon as I can (I'm travelling at the moment with
spotty connectivity), citing my and your reports.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-20 Thread dick hoogendijk
On Sun, 20 Jul 2008 11:26:16 -0700
Bill Sommerfeld [EMAIL PROTECTED] wrote:

 once is accident.  twice is coincidence.  three times is enemy
 action :-)

I have no access to b94 yet, but as it is, it probably is better to
skip this one when it comes out then.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
++ http://nagual.nl/ + SunOS sxce snv91 ++
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-18 Thread Jürgen Keil
 I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
 errors:

Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
on a system that is running post snv_94 bits:  It also found checksum errors

# zpool status files
  pool: files
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
config:

NAME  STATE READ WRITE CKSUM
files DEGRADED 0 018
  mirror  DEGRADED 0 018
c8t0d0s6  DEGRADED 0 036  too many errors
c9t0d0s6  DEGRADED 0 036  too many errors

errors: No known data errors


Addding the -v option to zpool status returned:


errors: Permanent errors have been detected in the following files:

metadata:0x0



OTOH, trying to verify checksums with zdb -c didn't find any problems:

# zdb -cvv files

Traversing all blocks to verify checksums and verify nothing leaked ...

No leaks (block sum matches space maps exactly)

bp count: 2804880
bp logical:121461614592  avg:  43303
bp physical:   84585684992   avg:  30156compression:   1.44
bp allocated:  85146115584   avg:  30356compression:   1.43
SPA allocated: 85146115584  used: 79.30%

951.08u 419.55s 2:24:34.32 15.8%
#
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-18 Thread Jürgen Keil
  I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
  errors:
 
 Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
 on a system that is running post snv_94 bits:  It also found checksum errors
...
 OTOH, trying to verify checksums with zdb -c didn't
 find any problems:

And  a zpool scrub under snv_85 doesn't find checksum errors, either.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-18 Thread Rustam Aliyev
I'm living with this error for almost 4 months and probably have record 
number of checksum errors:


core# zpool status -xv
 pool: box5
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   box5ONLINE   0 0   856
 mirrorONLINE   0 0   428
   c1d0ONLINE   0 0   856
   c2d0ONLINE   0 0   856
 mirrorONLINE   0 0   428
   c2d1ONLINE   0 0   856
   c1d1ONLINE   0 0   856

errors: Permanent errors have been detected in the following files:

   box5:0x0


I've Sol 10 U5 though.

--
Rustam.


Jürgen Keil wrote:

I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors:



Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
on a system that is running post snv_94 bits:  It also found checksum errors


# zpool status files
  pool: files
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
config:

NAME  STATE READ WRITE CKSUM
files DEGRADED 0 018
  mirror  DEGRADED 0 018
c8t0d0s6  DEGRADED 0 036  too many errors
c9t0d0s6  DEGRADED 0 036  too many errors

errors: No known data errors


Addding the -v option to zpool status returned:


errors: Permanent errors have been detected in the following files:

metadata:0x0



OTOH, trying to verify checksums with zdb -c didn't find any problems:

# zdb -cvv files

Traversing all blocks to verify checksums and verify nothing leaked ...

No leaks (block sum matches space maps exactly)

bp count: 2804880
bp logical:121461614592  avg:  43303
bp physical:   84585684992   avg:  30156compression:   1.44
bp allocated:  85146115584   avg:  30356compression:   1.43
SPA allocated: 85146115584  used: 79.30%

951.08u 419.55s 2:24:34.32 15.8%
#
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-17 Thread Bill Sommerfeld
I ran a scrub on a root pool after upgrading to snv_94, and got checksum
errors:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14
2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 2
  mirror  ONLINE   0 0 2
c4t0d0s0  ONLINE   0 0 4
c4t1d0s0  ONLINE   0 0 4

I ran it again, and it's now reporting the same errors, but still says
applications are unaffected:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 4
  mirror  ONLINE   0 0 4
c4t0d0s0  ONLINE   0 0 8
c4t1d0s0  ONLINE   0 0 8

errors: No known data errors


I wonder if I'm running into some combination of:

6725341 Running 'zpool scrub' repeatedly on a pool show an ever
increasing error count

and maybe:

6437568 ditto block repair is incorrectly propagated to root vdev

Any way to dig further to determine what's going on?

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss