Fwd: Re[3]: [zfs-discuss] zpool status and CKSUM errors

2006-07-17 Thread Robert Milkowski
Hi.

   Sorry for forward but maybe this will be more visible that way.

   I really think something strange is going on here and it's
   virtually impossible that I have a problem with hardware and get
   CKSUM errors (many of them) only for ditto blocks.


This is a forwarded message
From: Robert Milkowski <[EMAIL PROTECTED]>
To: Robert Milkowski <[EMAIL PROTECTED]>
Date: Sunday, July 9, 2006, 8:44:16 PM
Subject: [zfs-discuss] zpool status and CKSUM errors

===8<==Original message text===
Hello Robert,

Thursday, July 6, 2006, 1:49:34 AM, you wrote:

RM> Hello Eric,

RM> Monday, June 12, 2006, 11:21:24 PM, you wrote:

ES>> I reproduced this pretty easily on a lab machine.  I've filed:

ES>> 6437568 ditto block repair is incorrectly propagated to root vdev

ES>> To track this issue.  Keep in mind that you do have a flakey
ES>> controller/lun/something.  If this had been a user data block, your data
ES>> would be gone.


RM> I belive that something else is also happening here.
RM> I can see CKSUM errors on two different servers (v240 and T2000) all
RM> on non-redundant zpools and all the times it looks like ditto block
RM> helped - hey, it's just improbable.

RM> And while on T2000 from fmdump -ev I get:

RM> Jul 05 19:59:43.8786 ereport.io.fire.pec.btp   
0x14e4b8015f612002
RM> Jul 05 20:05:28.9165 ereport.io.fire.pec.re
0x14e5f951ce12b002
RM> Jul 05 20:05:58.5381 ereport.io.fire.pec.re
0x14e614e78f4c9002
RM> Jul 05 20:05:58.5389 ereport.io.fire.pec.btp   
0x14e614e7b6ddf002
RM> Jul 05 23:34:11.1960 ereport.io.fire.pec.re
0x1513869a6f7a6002
RM> Jul 05 23:34:11.1967 ereport.io.fire.pec.btp   
0x1513869a95196002
RM> Jul 06 00:09:17.1845 ereport.io.fire.pec.re
0x151b2fca4c988002
RM> Jul 06 00:09:17.1852 ereport.io.fire.pec.btp   
0x151b2fca72e6b002


RM> on v240 fmdump shows nothing for over a month and I'm sure I did zpool
RM> clear on that server later.


RM> v240:
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM> status: One or more devices has experienced an unrecoverable error.  An
RM> attempt was made to correct the error.  Applications are unaffected.
RM> action: Determine if the device needs to be replaced, and clear the errors
RM> using 'zpool clear' or replace the device with 'zpool replace'.
RM>see: http://www.sun.com/msg/ZFS-8000-9P
RM>  scrub: none requested
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0   167
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0   167

RM> errors: No known data errors
RM> bash-3.00#
RM> bash-3.00# zpool clear nfs-s5-s7
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM>  scrub: none requested
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0 0
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0 0

RM> errors: No known data errors
RM> bash-3.00#
RM> bash-3.00# zpool scrub nfs-s5-s7
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM>  scrub: scrub in progress, 0.01% done, 269h24m to go
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0 0
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0 0

RM> errors: No known data errors
RM> bash-3.00#

RM> We'll see the result - I hope I would have not to stop it in the
RM> morning. Anyway I have a feeling that nothing will be reported.


RM> ps. I've got several similar pools on those two servers and I see
RM> CKSUM errors on all of them with the same result - it's almost
RM> impossible.


ok, it took several days actually to complete scrub.
During scrub I saw some CKSUM errors already and now again there are
many of them, however scrub itself reported no errors at all.

bash-3.00# zpool status nfs-s5-s7
  pool: nfs-s5-s7
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Sun Jul  9 02:56:19 2006
config:

NAME STATE READ WRITE CKSUM
nfs-s5-s7ONLINE   0 018
  c4t600C0FF009258F28706F5201d0  ONLINE   0 018

errors: No known data errors
bash-3.00#


-- 
Best regards,
 Robertmailt

Re[3]: [zfs-discuss] zpool status and CKSUM errors

2006-07-09 Thread Robert Milkowski
Hello Robert,

Thursday, July 6, 2006, 1:49:34 AM, you wrote:

RM> Hello Eric,

RM> Monday, June 12, 2006, 11:21:24 PM, you wrote:

ES>> I reproduced this pretty easily on a lab machine.  I've filed:

ES>> 6437568 ditto block repair is incorrectly propagated to root vdev

ES>> To track this issue.  Keep in mind that you do have a flakey
ES>> controller/lun/something.  If this had been a user data block, your data
ES>> would be gone.


RM> I belive that something else is also happening here.
RM> I can see CKSUM errors on two different servers (v240 and T2000) all
RM> on non-redundant zpools and all the times it looks like ditto block
RM> helped - hey, it's just improbable.

RM> And while on T2000 from fmdump -ev I get:

RM> Jul 05 19:59:43.8786 ereport.io.fire.pec.btp   
0x14e4b8015f612002
RM> Jul 05 20:05:28.9165 ereport.io.fire.pec.re
0x14e5f951ce12b002
RM> Jul 05 20:05:58.5381 ereport.io.fire.pec.re
0x14e614e78f4c9002
RM> Jul 05 20:05:58.5389 ereport.io.fire.pec.btp   
0x14e614e7b6ddf002
RM> Jul 05 23:34:11.1960 ereport.io.fire.pec.re
0x1513869a6f7a6002
RM> Jul 05 23:34:11.1967 ereport.io.fire.pec.btp   
0x1513869a95196002
RM> Jul 06 00:09:17.1845 ereport.io.fire.pec.re
0x151b2fca4c988002
RM> Jul 06 00:09:17.1852 ereport.io.fire.pec.btp   
0x151b2fca72e6b002


RM> on v240 fmdump shows nothing for over a month and I'm sure I did zpool
RM> clear on that server later.


RM> v240:
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM> status: One or more devices has experienced an unrecoverable error.  An
RM> attempt was made to correct the error.  Applications are unaffected.
RM> action: Determine if the device needs to be replaced, and clear the errors
RM> using 'zpool clear' or replace the device with 'zpool replace'.
RM>see: http://www.sun.com/msg/ZFS-8000-9P
RM>  scrub: none requested
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0   167
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0   167

RM> errors: No known data errors
RM> bash-3.00#
RM> bash-3.00# zpool clear nfs-s5-s7
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM>  scrub: none requested
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0 0
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0 0

RM> errors: No known data errors
RM> bash-3.00#
RM> bash-3.00# zpool scrub nfs-s5-s7
RM> bash-3.00# zpool status nfs-s5-s7
RM>   pool: nfs-s5-s7
RM>  state: ONLINE
RM>  scrub: scrub in progress, 0.01% done, 269h24m to go
RM> config:

RM> NAME STATE READ WRITE CKSUM
RM> nfs-s5-s7ONLINE   0   0 0
RM>   c4t600C0FF009258F28706F5201d0  ONLINE   0   0 0

RM> errors: No known data errors
RM> bash-3.00#

RM> We'll see the result - I hope I would have not to stop it in the
RM> morning. Anyway I have a feeling that nothing will be reported.


RM> ps. I've got several similar pools on those two servers and I see
RM> CKSUM errors on all of them with the same result - it's almost
RM> impossible.


ok, it took several days actually to complete scrub.
During scrub I saw some CKSUM errors already and now again there are
many of them, however scrub itself reported no errors at all.

bash-3.00# zpool status nfs-s5-s7
  pool: nfs-s5-s7
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Sun Jul  9 02:56:19 2006
config:

NAME STATE READ WRITE CKSUM
nfs-s5-s7ONLINE   0 018
  c4t600C0FF009258F28706F5201d0  ONLINE   0 018

errors: No known data errors
bash-3.00#


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss