Re: [zfs-discuss] Many checksum errors during resilver.

2010-06-21 Thread Cindy Swearingen

Hi Justin,

This looks like an older Solaris 10 release. If so, this looks like
a zpool status display bug, where it looks like the checksum errors
are occurring on the replacement device, but they are not.

I would review the steps described in the hardware section of the ZFS
troubleshooting wiki to confirm that the new disk is working as
expected:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

Then, follow steps in the Notify FMA That Device Replacement is Complete
section to reset FMA. Then, start monitoring the replacement device
with fmdump to see if any new activity occurs on this device.

Thanks,

Cindy


On 06/21/10 10:21, Justin Daniel Meyer wrote:

I've decided to upgrade my home server capacity by replacing the disks in one 
of my mirror vdevs.  The procedure appeared to work out, but during resilver, a 
couple million checksum errors were logged on the new device. I've read through 
quite a bit of the archive and searched around a bit, but can not find anything 
definitive to ease my mind on whether to proceed.


SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go
config:

NAME  STATE READ WRITE CKSUM
tank  DEGRADED 0 0 0
  mirror  DEGRADED 0 0 0
replacing DEGRADED   215 0 0
  c1t6d0s0/o  FAULTED  0 0 0  corrupted data
  c1t6d0  ONLINE   0 0   215  3.73M resilvered
c1t2d0ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t1d0ONLINE   0 0 0
c1t5d0ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0ONLINE   0 0 0
c1t4d0ONLINE   0 0 0
logs
  c8t1d0p1ONLINE   0 0 0
cache
  c2t1d0p2ONLINE   0 0 0


During the resilver, the cache device and the zil were both removed for errors 
(1-2k each).  (Despite the c2/c8 discrepancy, they are partitions on the same 
OCZvertexII device.)


# zpool status -xv tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  mirrorONLINE   0 0 0
c1t6d0  ONLINE   0 0 2.69M  539G resilvered
c1t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
logs
  c8t1d0p1  REMOVED  0 0 0
cache
  c2t1d0p2  REMOVED  0 0 0

I cleared the errors (about 5000/GB resilvered!), removed the cache device, and 
replaced the zil partition with the whole device.  After 3 pool scrubs with no 
errors, I want to check with someone else that it appears okay to replace the 
second drive in this mirror vdev.  The one thing I have not tried is a large 
file transfer to the server, as I am also dealing with an NFS mount problem 
which popped up suspiciously close to my most recent patch update.


# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
logs
  c0t0d0ONLINE   0 0 0

errors: No known data errors


/var/adm/messages is positively over-run with these triplets/quadruplets, not all of 
which end which end up as "fatal" type.


Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p

[zfs-discuss] Many checksum errors during resilver.

2010-06-21 Thread Justin Daniel Meyer
I've decided to upgrade my home server capacity by replacing the disks in one 
of my mirror vdevs.  The procedure appeared to work out, but during resilver, a 
couple million checksum errors were logged on the new device. I've read through 
quite a bit of the archive and searched around a bit, but can not find anything 
definitive to ease my mind on whether to proceed.


SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go
config:

NAME  STATE READ WRITE CKSUM
tank  DEGRADED 0 0 0
  mirror  DEGRADED 0 0 0
replacing DEGRADED   215 0 0
  c1t6d0s0/o  FAULTED  0 0 0  corrupted data
  c1t6d0  ONLINE   0 0   215  3.73M resilvered
c1t2d0ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t1d0ONLINE   0 0 0
c1t5d0ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0ONLINE   0 0 0
c1t4d0ONLINE   0 0 0
logs
  c8t1d0p1ONLINE   0 0 0
cache
  c2t1d0p2ONLINE   0 0 0


During the resilver, the cache device and the zil were both removed for errors 
(1-2k each).  (Despite the c2/c8 discrepancy, they are partitions on the same 
OCZvertexII device.)


# zpool status -xv tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  mirrorONLINE   0 0 0
c1t6d0  ONLINE   0 0 2.69M  539G resilvered
c1t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
logs
  c8t1d0p1  REMOVED  0 0 0
cache
  c2t1d0p2  REMOVED  0 0 0

I cleared the errors (about 5000/GB resilvered!), removed the cache device, and 
replaced the zil partition with the whole device.  After 3 pool scrubs with no 
errors, I want to check with someone else that it appears okay to replace the 
second drive in this mirror vdev.  The one thing I have not tried is a large 
file transfer to the server, as I am also dealing with an NFS mount problem 
which popped up suspiciously close to my most recent patch update.


# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
logs
  c0t0d0ONLINE   0 0 0

errors: No known data errors


/var/adm/messages is positively over-run with these triplets/quadruplets, not 
all of which end which end up as "fatal" type.


Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8...@7/d...@1,0 (sd14):
Jun 19 21:43:19 deepthought Error for Command: write(10)   
Error Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]   Requested 
Block: 26721062  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]   Vendor: ATA 
   Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]   Sense Key: 
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]   ASC: 0x8 (LUN 
communication failure), ASCQ: 0x0, FRU: 0x0
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci1043,8.