Re: [zfs-discuss] zpool CKSUM errors since drive replace

2008-10-28 Thread Matthew Angelo
Another update:

Last night, already reading many blogs about si3124 chipset problems with
Solaris 10 I applied the Patch Id: 138053-02 which updates si3124 from 1.2
to 1.4 and fixes numerous performance and interrupt related bugs.

And it appears to have helped.Below is the zpool scrub after the new
driver, but I'm still not confident on the exact problem.

# zpool status -v
  pool: rzdata
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 1 errors on Wed Oct 29 05:32:16 2008
config:

NAMESTATE READ WRITE CKSUM
rzdata  ONLINE   0 0 2
  raidz1ONLINE   0 0 2
c3t0d0  ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
c3t2d0  ONLINE   0 0 0
c3t3d0  ONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 3
c4t3d0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

/rzdata/downloads/linux/ubuntu-8.04.1-desktop-i386.iso

It still didn't clear the errored file I have, which I'm curious about
considering it's a RAIDZ.

On Mon, Oct 27, 2008 at 2:57 PM, Matthew Angelo [EMAIL PROTECTED] wrote:

 Another update.

 Weekly cron kicked in again this week, but this time is failed with a lot
 of CKSUM errors and now also complained about corrupted files.  The single
 file it complained about is a new one I recently copied into it.

 I'm stumped with this.  How do I verify the x86 hardware under the OS?

 I've run Memtest86 and it ran overnight without a problem.  Tonight I will
 be moving back to my old Motherboard/CPU/Memory.  Hopefully this is a simple
 hardware problems.

 But the question I'd like to pose to everyone is, how can we validate our
 x86 hardware?


 On Tue, Oct 21, 2008 at 8:23 AM, David Turnbull [EMAIL PROTECTED]wrote:

 I don't think it's normal, no.. it seems to occur when the resilver is
 interrupted and gets marked as done prematurely?


 On 20/10/2008, at 12:28 PM, Matthew Angelo wrote:

  Hi David,

 Thanks for the additional input.   This is the reason why I thought I'd
 start a thread about it.

 To continue my original topic, I have additional information to add.
 After last weeks initial replace/resilver/scrub -- my weekly cron scrub
 (runs Sunday morning) kicked off and all CKSUM errors have now cleared:


  pool: rzdata
  state: ONLINE
  scrub: scrub completed with 0 errors on Mon Oct 20 09:41:31 2008
 config:

NAMESTATE READ WRITE CKSUM
rzdata  ONLINE   0 0 0
  raidz1ONLINE   0 0 0
c3t0d0  ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
c3t2d0  ONLINE   0 0 0
c3t3d0  ONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0

 errors: No known data errors


 Which requires me to ask -- is it standard for high Checksum (CKSUM)
 errors on a zpool when you replace a failed disk after it has resilvered?

 Is there anything I can feedback into the zfs community on this matter?

 Matt

 On Sun, Oct 19, 2008 at 9:26 AM, David Turnbull [EMAIL PROTECTED]
 wrote:
 Hi Matthew.

 I had a similar problem occur last week. One disk in the raidz had the
 first 4GB zeroed out (manually) before we then offlined it and replaced with
 a new disk.
 High checksum errors were occuring on the partially-zeroed disk, as you'd
 expect, but when the new disk was inserted, checksum errors occured on all
 disks.

 Not sure how relevant this is to your particular situation, but
 unexpected checksum errors on known-good hardware has definitely happened to
 me as well.

 -- Dave


 On 15/10/2008, at 10:50 PM, Matthew Angelo wrote:

 The original disk failure was very explicit.  High Read Errors and errors
 inside /var/adm/messages.

 When I replaced the disk however, these have all gone and the resilver
 was okay.  I am not seeing any read/write or /var/adm/messages errors -- but
 for some reason I am seeing errors inside the CKSUM column which I've never
 seen before.

 I hope you're right and it's a simple memory corruption problem.   I will
 be running memtest86 overnight and hopefully it fails so we can rule our
 zfs.


 On Wed, Oct 15, 2008 at 11:48 AM, Mark J Musante [EMAIL PROTECTED]
 wrote:
  So this is where I stand.  I'd like to ask zfs-discuss if they've seen
 any ZIL/Replay style bugs associated with u3/u5 x86?  Again, I'm confident
 in my 

Re: [zfs-discuss] zpool CKSUM errors since drive replace

2008-10-15 Thread Matthew Angelo
The original disk failure was very explicit.  High Read Errors and errors
inside /var/adm/messages.

When I replaced the disk however, these have all gone and the resilver was
okay.  I am not seeing any read/write or /var/adm/messages errors -- but for
some reason I am seeing errors inside the CKSUM column which I've never seen
before.

I hope you're right and it's a simple memory corruption problem.   I will be
running memtest86 overnight and hopefully it fails so we can rule our zfs.


On Wed, Oct 15, 2008 at 11:48 AM, Mark J Musante [EMAIL PROTECTED]wrote:

  So this is where I stand.  I'd like to ask zfs-discuss if they've seen
 any ZIL/Replay style bugs associated with u3/u5 x86?  Again, I'm confident
 in my hardware, and /var/adm/messages is showing no warnings/errors.

 Are you absolutely sure the hardware is OK?  Is there another disk you can
 test in its place?  If I read your post correctly, your first disk was
 having errors logged against it, and now the second disk -- plugged into the
 same port -- is also logging errors.

 This seems to me more like the port is bad.  Is there a third disk you can
 try in that same port?

 I have a hard time seeing that this could be a zfs bug - I've been doing
 lots of testing on u5 and the only time I see checksum errors is when I
 deliberately induce them.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool CKSUM errors since drive replace

2008-10-14 Thread Mark J Musante
 So this is where I stand.  I'd like to ask zfs-discuss if they've seen any 
 ZIL/Replay style bugs associated with u3/u5 x86?  Again, I'm confident in my 
 hardware, and /var/adm/messages is showing no warnings/errors.

Are you absolutely sure the hardware is OK?  Is there another disk you can test 
in its place?  If I read your post correctly, your first disk was having errors 
logged against it, and now the second disk -- plugged into the same port -- is 
also logging errors.

This seems to me more like the port is bad.  Is there a third disk you can try 
in that same port?

I have a hard time seeing that this could be a zfs bug - I've been doing lots 
of testing on u5 and the only time I see checksum errors is when I deliberately 
induce them.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss