Re: [zfs-discuss] Checksum errors with SSD.

2010-07-02 Thread Benjamin Grogg
Dear Cindy and Edward

Many thanks for your input. Indeed there is something wrong with the SSD.
Smartmontools confirm me also couples of errors.
So I open a case and hopefully they will replace the SSD. What I learned?
- Be careful of special offers
- Use also rock solid components for your homeserver
- Use ZFS, Scrub regularly

Best regards and many thanks for all your help and keep up the good work!
Benjamin
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Checksum errors with SSD.

2010-07-01 Thread Benjamin Grogg
Dear Forum

I use a KINGSTON SNV125-S2/30GB SSD on a ASUS M3A78-CM Motherboard (AMD SB700 
Chipset).
SATA Type (in BIOS) is SATA 
Os : SunOS homesvr 5.11 snv_134 i86pc i386 i86pc

When I scrub my pool I got a lot of checksum errors :

NAMESTATE READ WRITE CKSUM
rpool   DEGRADED 0 0 5
  c8d0s0DEGRADED 0 071  too many errors

zpool clear rpool works after a scrub I have again the same situation.
fmstat looks like this :

module ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire0   0  0.00.0   0   0 0 0  0  0
disk-transport   0   0  0.0 1541.1   0   0 0 032b  0
eft  1   0  0.04.7   0   0 0 0   1.2M  0
ext-event-transport   3   0  0.02.1   0   0 0 0  0  0
fabric-xlate 0   0  0.00.0   0   0 0 0  0  0
fmd-self-diagnosis   6   0  0.00.0   0   0 0 0  0  0
io-retire0   0  0.00.0   0   0 0 0  0  0
sensor-transport 0   0  0.0   37.3   0   0 0 032b  0
snmp-trapgen 3   0  0.01.1   0   0 0 0  0  0
sysevent-transport   0   0  0.0 2836.3   0   0 0 0  0  0
syslog-msgs  3   0  0.02.7   0   0 0 0  0  0
zfs-diagnosis   91  77  0.0   28.9   0   0 2 1   336b   280b
zfs-retire  10   0  0.0  387.9   0   0 0 0   620b  0

fmadm looks like this :

---   -- -
TIMEEVENT-ID  MSG-ID SEVERITY
---   -- -
Jun 30 16:37:28 806072e5-7cd6-efc1-c89d-d40bce4adf72  ZFS-8000-GHMajor 

Host: homesvr
Platform: System-Product-Name   Chassis_id  : System-Serial-Number
Product_sn  : 

Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service
Problem in  : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service

In /var/adm/messages I don't have any abnormal issues.
I can put the SSD also on a other SATA-Port but without success.

My other HDD runs smoothly :

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c4d1ONLINE   0 0 0
c5d0ONLINE   0 0 0

iostat gives me following :

c4d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: WDC WD10EVDS-63 Revision:  Serial No:  WD-WCAV592 Size: 1000.20GB 
1000202305536 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c5d0 Soft Errors: 981 Hard Errors: 0 Transport Errors: 981 
Model: Hitachi HDS7210 Revision:  Serial No:   JP2921HQ0 Size: 1000.20GB 
1000202305536 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: KINGSTON SSDNOW Revision:  Serial No: 30PM10I Size: 30.02GB 
30016659456 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 

Any hints?
Best regards and many thanks for your help!

Benjamin
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors with SSD.

2010-07-01 Thread Cindy Swearingen

Hi Benjamin,

I'm not familiar with this disk but you can see the fmstat output that
disk, system event, and zfs-related diagnostics are on overtime about
something and its probably this disk.

You can get further details from fmdump -eV and you will probably
see lots of checksum errors on this disk.

You might review some of the h/w diagnostic recommendations in this wiki:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

I would recommend replacing the disk, soon, or figure out what other
issue might be causing problems for this disk.

Thanks,

Cindy
Benjamin Grogg wrote:

Dear Forum

I use a KINGSTON SNV125-S2/30GB SSD on a ASUS M3A78-CM Motherboard (AMD SB700 
Chipset).
SATA Type (in BIOS) is SATA 
Os : SunOS homesvr 5.11 snv_134 i86pc i386 i86pc


When I scrub my pool I got a lot of checksum errors :

NAMESTATE READ WRITE CKSUM
rpool   DEGRADED 0 0 5
  c8d0s0DEGRADED 0 071  too many errors

zpool clear rpool works after a scrub I have again the same situation.
fmstat looks like this :

module ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire0   0  0.00.0   0   0 0 0  0  0
disk-transport   0   0  0.0 1541.1   0   0 0 032b  0
eft  1   0  0.04.7   0   0 0 0   1.2M  0
ext-event-transport   3   0  0.02.1   0   0 0 0  0  0
fabric-xlate 0   0  0.00.0   0   0 0 0  0  0
fmd-self-diagnosis   6   0  0.00.0   0   0 0 0  0  0
io-retire0   0  0.00.0   0   0 0 0  0  0
sensor-transport 0   0  0.0   37.3   0   0 0 032b  0
snmp-trapgen 3   0  0.01.1   0   0 0 0  0  0
sysevent-transport   0   0  0.0 2836.3   0   0 0 0  0  0
syslog-msgs  3   0  0.02.7   0   0 0 0  0  0
zfs-diagnosis   91  77  0.0   28.9   0   0 2 1   336b   280b
zfs-retire  10   0  0.0  387.9   0   0 0 0   620b  0

fmadm looks like this :

---   -- -
TIMEEVENT-ID  MSG-ID SEVERITY
---   -- -
Jun 30 16:37:28 806072e5-7cd6-efc1-c89d-d40bce4adf72  ZFS-8000-GHMajor 


Host: homesvr
Platform: System-Product-Name   Chassis_id  : System-Serial-Number
Product_sn  : 


Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service
Problem in  : zfs://pool=rpool/vdev=f7dad7554a72b3bc
  faulted but still in service

In /var/adm/messages I don't have any abnormal issues.
I can put the SSD also on a other SATA-Port but without success.

My other HDD runs smoothly :

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c4d1ONLINE   0 0 0
c5d0ONLINE   0 0 0

iostat gives me following :

c4d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: WDC WD10EVDS-63 Revision:  Serial No:  WD-WCAV592 Size: 1000.20GB 1000202305536 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c5d0 Soft Errors: 981 Hard Errors: 0 Transport Errors: 981 
Model: Hitachi HDS7210 Revision:  Serial No:   JP2921HQ0 Size: 1000.20GB 1000202305536 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 
c8d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Model: KINGSTON SSDNOW Revision:  Serial No: 30PM10I Size: 30.02GB 30016659456 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 


Any hints?
Best regards and many thanks for your help!

Benjamin

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors with SSD.

2010-07-01 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Benjamin Grogg
 
 When I scrub my pool I got a lot of checksum errors :
 
 NAMESTATE READ WRITE CKSUM
 rpool   DEGRADED 0 0 5
   c8d0s0DEGRADED 0 071  too many errors
 
 Any hints?

What's the confusion?  Replace the drive.

If you think it's a false positive (drive is not actually failing) then you
would zpool clear, (or online, or whatever, until the pool looks normal
again) and then scrub.  If the errors come back, it definitely means the
drive is failing.  Or perhaps the sata cable that connects to it, or perhaps
the controller.  But 99% certain the drive.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Checksum errors on and after resilver

2010-04-14 Thread bonso
Hi all,
 I recently experienced a disk failure on my home server and observed checksum 
errors while resilvering the pool and on the first scrub after the resilver had 
completed. Now everything seems fine but I'm posting this to get help with 
calming my nerves and detect any possible future faults.

 Lets start with some specs.
OSOL 2009.06
Intel SASUC8i (w LSI 1.30IT FW)
Gigabyte MA770-UD3 mobo w 8GB ECC RAM
Hitachi P7K500 harddrives

 When checking the condition of my pool some days ago (yes I should make it 
mail me if something like this happens again) one disk in my pool was labeled 
as Removed with a small number of read errors, nineish I think, all other 
disks where fine. I removed tested (DFT crashed so the disk seemed very broken) 
replaced the drive and started a resilver.

 Checking the status of the resilver everything looked good from the start but 
when it was finished the status report looked like this:
  pool: sasuc8i
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 4h9m with 0 errors on Mon Apr 12 18:12:26 2010
config:

NAME STATE READ WRITE CKSUM
sasuc8i  ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c12t4d0  ONLINE   0 0 5  108K resilvered
c12t8d0  ONLINE   0 0 0  254G resilvered
c12t6d0  ONLINE   0 0 0
c12t7d0  ONLINE   0 0 0
c12t0d0  ONLINE   0 0 1  21.5K resilvered
c12t1d0  ONLINE   0 0 2  43K resilvered
c12t2d0  ONLINE   0 0 4  86K resilvered
c12t3d0  ONLINE   0 0 1  21.5K resilvered

errors: No known data errors

 All I really cared about at this point was the Applications are unaffected 
and No known data errors and I thought that the checksum errors might be down 
to the failing drive (c12t5d0 failed, the controlled labeled the new drive as 
c12t8d0) going out during a write. Then again ZFS is atomic, better clear the 
errors and run a scrub, it came out like this: 
  pool: sasuc8i
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 1h16m with 0 errors on Tue Apr 13 01:29:32 2010
config:

NAME STATE READ WRITE CKSUM
sasuc8i  ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c12t4d0  ONLINE   0 0 5
c12t8d0  ONLINE   0 0 0
c12t6d0  ONLINE   0 0 0
c12t7d0  ONLINE   0 0 4  86K repaired
c12t0d0  ONLINE   0 0 1
c12t1d0  ONLINE   0 0 6  86K repaired
c12t2d0  ONLINE   0 0 4
c12t3d0  ONLINE   0 0 6  108K repaired

errors: No known data errors

 Now I'm getting nervous. Checksum errors, some repaired others not. Am I going 
to end up with multiple drive failures or what the * is going on here?

 Ran one more scrub and everything came up roses.
 Checked smart status on the drives with checksum errors and they are fine, 
allthough I expect only read/write errors would show up there.

 I'm not sure of how to get this into a propper question but what I'm after is 
is this normal to be expected after a resilver and can I start breathing 
again?. Checksum errors are as far as I can gather dodgy data on disk and 
read/write somewhere in the physical link (more or less).

Thank you!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors on and after resilver

2010-04-14 Thread Richard Elling
[this seems to be the question of the day, today...]

On Apr 14, 2010, at 2:57 AM, bonso wrote:

 Hi all,
 I recently experienced a disk failure on my home server and observed checksum 
 errors while resilvering the pool and on the first scrub after the resilver 
 had completed. Now everything seems fine but I'm posting this to get help 
 with calming my nerves and detect any possible future faults.
 
 Lets start with some specs.
 OSOL 2009.06
 Intel SASUC8i (w LSI 1.30IT FW)
 Gigabyte MA770-UD3 mobo w 8GB ECC RAM
 Hitachi P7K500 harddrives
 
 When checking the condition of my pool some days ago (yes I should make it 
 mail me if something like this happens again) one disk in my pool was labeled 
 as Removed with a small number of read errors, nineish I think, all other 
 disks where fine. I removed tested (DFT crashed so the disk seemed very 
 broken) replaced the drive and started a resilver.
 
 Checking the status of the resilver everything looked good from the start but 
 when it was finished the status report looked like this:
  pool: sasuc8i
 state: ONLINE
 status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 4h9m with 0 errors on Mon Apr 12 18:12:26 2010
 config:
 
   NAME STATE READ WRITE CKSUM
   sasuc8i  ONLINE   0 0 0
 raidz2 ONLINE   0 0 0
   c12t4d0  ONLINE   0 0 5  108K resilvered
   c12t8d0  ONLINE   0 0 0  254G resilvered
   c12t6d0  ONLINE   0 0 0
   c12t7d0  ONLINE   0 0 0
   c12t0d0  ONLINE   0 0 1  21.5K resilvered
   c12t1d0  ONLINE   0 0 2  43K resilvered
   c12t2d0  ONLINE   0 0 4  86K resilvered
   c12t3d0  ONLINE   0 0 1  21.5K resilvered
 
 errors: No known data errors
 
 All I really cared about at this point was the Applications are unaffected 
 and No known data errors and I thought that the checksum errors might be 
 down to the failing drive (c12t5d0 failed, the controlled labeled the new 
 drive as c12t8d0) going out during a write. Then again ZFS is atomic, better 
 clear the errors and run a scrub, it came out like this: 
  pool: sasuc8i
 state: ONLINE
 status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 1h16m with 0 errors on Tue Apr 13 01:29:32 2010
 config:
 
   NAME STATE READ WRITE CKSUM
   sasuc8i  ONLINE   0 0 0
 raidz2 ONLINE   0 0 0
   c12t4d0  ONLINE   0 0 5
   c12t8d0  ONLINE   0 0 0
   c12t6d0  ONLINE   0 0 0
   c12t7d0  ONLINE   0 0 4  86K repaired
   c12t0d0  ONLINE   0 0 1
   c12t1d0  ONLINE   0 0 6  86K repaired
   c12t2d0  ONLINE   0 0 4
   c12t3d0  ONLINE   0 0 6  108K repaired
 
 errors: No known data errors
 
 Now I'm getting nervous. Checksum errors, some repaired others not. Am I 
 going to end up with multiple drive failures or what the * is going on here?

When I see many disks suddenly reporting errors, I suspect a common
element: HBA, cables, backplane, mobo, CPU, power supply, etc.

If you search the zfs-discuss archives you can find instances where
HBA firmware, driver issues, or firmware+driver interactions caused
such reports. Cabling and power supplies are less commonly reported.

 Ran one more scrub and everything came up roses.
 Checked smart status on the drives with checksum errors and they are fine, 
 allthough I expect only read/write errors would show up there.
 
 I'm not sure of how to get this into a propper question but what I'm after is 
 is this normal to be expected after a resilver and can I start breathing 
 again?. Checksum errors are as far as I can gather dodgy data on disk and 
 read/write somewhere in the physical link (more or less).

Breathing is good.  Then check your firmware releases.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] checksum errors increasing on spare vdev?

2010-03-17 Thread Eric Sproul
Hi,
One of my colleagues was confused by the output of 'zpool status' on a pool
where a hot spare is being resilvered in after a drive failure:

$ zpool status data
  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h56m, 23.78% done, 3h1m to go
config:

NAME  STATE READ WRITE CKSUM
data  DEGRADED 0 0 0
  raidz1  ONLINE   0 0 0
c0t2d0ONLINE   0 0 0
c1t2d0ONLINE   0 0 0
c0t4d0ONLINE   0 0 0
c0t5d0ONLINE   0 0 0
c1t4d0ONLINE   0 0 0
c0t7d0ONLINE   0 0 0
  raidz1  DEGRADED 0 0 0
spare DEGRADED 0 0 2.89M
  c0t1d0  REMOVED  0 0 0
  c0t6d0  ONLINE   0 0 0  59.3G resilvered
c1t5d0ONLINE   0 0 0
c0t3d0ONLINE   0 0 0
c1t1d0ONLINE   0 0 0
c1t3d0ONLINE   0 0 0
c1t6d0ONLINE   0 0 0
spares
  c0t6d0  INUSE currently in use

The CKSUM error count is increasing so he thought that the spare was also
failing.  I disagreed because the errors were being recorded on the fake vdev
spare, but I want to make sure my hunch is correct.

My hunch is that since reads from userland continue to come to the pool, and
since it's raidz, some of those reads will be for zobject addresses on the
failed drive, now represented by the spare.  Because the data at those addresses
is uninitialized, we get checksum errors.

I guess I really have two questions:
1. Am I correct about the source of the checksum errors attributed to the
spare vdev?
2. During raidz resilver, if a read happens for an address that is among what's
already been resilvered, will that read succeed, or will ALL reads to that
top-level vdev require reconstruction from the other leaf vdevs?

If the answer to #2 is that reads will succeed if they ask for data that's been
resilvered, then I might expect my read performance to increase as resilver
progresses, as less and less data requires reconstruction.  I haven't measured
this in a controlled environment though, so I'm mostly just curious about the
theory.

Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Checksum errors

2009-06-17 Thread UNIX admin
pool: space01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 2.48% done, 4h18m to go
config:

NAME STATE READ WRITE CKSUM
space01  ONLINE   0 0 0
  raidz  ONLINE   0 0 0
c0t1d0   ONLINE   0 0 0
c0t2d0   ONLINE   0 0 0
c0t3d0   ONLINE   0 0 0
  raidz  ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 2

errors: No known data errors
The last drive shows two checksum errors, but iostat(1M) shows no hardware 
errors on that disk:

iostat -Ene | grep Hard | grep c1t11d0
c1t11d0  Soft Errors: 178 Hard Errors: 0 Transport Errors: 0

I'm not sure what I need to do, respectively how else I can determine if the 
device needs replaced.
Do I perform zpool clear, or do I need to replace c1t11d0, or do I rerun scrub?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors

2009-06-17 Thread Cindy . Swearingen

Hi UNIX admin,

I would check fmdump -eV output to see if this error is isolated or
persistent.

If fmdump says this error is isolated, then you might just monitor the
status. For example, if fmdump says that these errors occurred on 6/15
and you moved this system on that date or you know that someone
shouted at c1t11d0 on that date, then those events might explain this
issue and you can use zpool clear to clear the error state.

If fmdump says the c1t11d0 error persists over a period of time, then I
could consider replacing this device.

You can review more diagnostic tips here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Resolving_Hardware_Problems

Cindy

UNIX admin wrote:

pool: space01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 2.48% done, 4h18m to go
config:

NAME STATE READ WRITE CKSUM
space01  ONLINE   0 0 0
  raidz  ONLINE   0 0 0
c0t1d0   ONLINE   0 0 0
c0t2d0   ONLINE   0 0 0
c0t3d0   ONLINE   0 0 0
  raidz  ONLINE   0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 2

errors: No known data errors
The last drive shows two checksum errors, but iostat(1M) shows no hardware 
errors on that disk:

iostat -Ene | grep Hard | grep c1t11d0
c1t11d0  Soft Errors: 178 Hard Errors: 0 Transport Errors: 0

I'm not sure what I need to do, respectively how else I can determine if the device 
needs replaced.
Do I perform zpool clear, or do I need to replace c1t11d0, or do I rerun scrub?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] checksum errors on Sun Fire X4500

2009-01-22 Thread Jay Anderson
I have b105 running on a Sun Fire X4500, and I am constantly seeing checksum 
errors reported by zpool status. The errors are showing up over time on every 
disk in the pool. In normal operation there might be errors on two or three 
disks each day, and sometimes there are enough errors so it reports too many 
errors, and the disk goes into a degraded state. I have had to remove the 
spares from the pool because otherwise the spares get pulled into the pool to 
replace the drives. There are no reported hardware problems with any of the 
drives. I have run scrub multiple times, and this also generates checksum 
errors. After the scrub completes the checksums continue to occur during normal 
operation.

This problem also occurred with b103. Before that Solaris 10u4 was installed on 
the server, and it never had any checksum errors. With the OpenSolaris builds I 
am running CIFS Server, and that's the only difference in server function from 
when Solaris 10u4 was installed on it.

Is this a known issue? Any suggestions or workarounds?

Thank you.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on Sun Fire X4500

2009-01-22 Thread Carsten Aulbert
Hi Jay,

Jay Anderson schrieb:
 I have b105 running on a Sun Fire X4500, and I am constantly seeing checksum 
 errors reported by zpool status. The errors are showing up over time on every 
 disk in the pool. In normal operation there might be errors on two or three 
 disks each day, and sometimes there are enough errors so it reports too many 
 errors, and the disk goes into a degraded state. I have had to remove the 
 spares from the pool because otherwise the spares get pulled into the pool to 
 replace the drives. There are no reported hardware problems with any of the 
 drives. I have run scrub multiple times, and this also generates checksum 
 errors. After the scrub completes the checksums continue to occur during 
 normal operation.
 
 This problem also occurred with b103. Before that Solaris 10u4 was installed 
 on the server, and it never had any checksum errors. With the OpenSolaris 
 builds I am running CIFS Server, and that's the only difference in server 
 function from when Solaris 10u4 was installed on it.
 
 Is this a known issue? Any suggestions or workarounds?

We had something similar two or three disk slots which started to act
weird and failed quite often - usually starting with a high error rate.
After exchanging two hard drives, the Sun hotline initiated to exchange
the backplane - essentially the chassis was replaced.

Since then, we have not encountered anything like this anymore.

So it *might* be the backplane or a broken Marvell controller, but it's
hard to judge.

HTH

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Thomas Nau
Dear all

As we wanted to patch one of our iSCSI Solaris servers we had to offline 
the ZFS submirrors on the clients connected to that server. The devices 
connected to the second server stayed online so the pools on the clients 
were still available but in degraded mode. When the server came back 
up we onlined the devices on the clients an the resilver completed pretty 
quickly as the filesystem was read-mostly (ftp, http server)

Nevertheless during the first hour of operation after onlining we 
recognized numerous checksum errors on the formerly offlined device. We 
decided to scrub the pool and after several hours we got about 3500 error 
in 600GB of data.

I always thought that ZFS would sync the mirror immediately after bringing 
the device online not requiring a scrub. Am I wrong?

Both, servers and clients run s10u5 with the latest patches but we 
saw the same behaviour with OpenSolaris clients

Any hints?
Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Miles Nordin
 tn == Thomas Nau [EMAIL PROTECTED] writes:

tn Nevertheless during the first hour of operation after onlining
tn we recognized numerous checksum errors on the formerly
tn offlined device. We decided to scrub the pool and after
tn several hours we got about 3500 error in 600GB of data.

Did you use 'zpool offline' when you took them down, or did you
offline them some other way, like by breaking the network connection,
stopping the iSCSI target daemon, or 'iscsiadm remove
discovery-address ..' on the initiator?

This is my experience, too (but with old b71).  I'm also using iSCSI.
It might be a variant of this:

 http://bugs.opensolaris.org/view_bug.do?bug_id=6675685
 checksum errors after 'zfs offline ; reboot'

Aside from the fact the checksum-errored blocks are silently not
redundant, it's also interesting because I think, in general, there
are a variety of things which can cause checksum errors besides
disk/cable/controller problems.  I wonder if they're useful for
diagnosing disk problems only in very gently-used setups, or not at
all?

Another iSCSI problem: for me, the targets I've 'zpool offline'd will
automatically ONLINE themselves when iSCSI rediscovers them.  but only
sometimes.  I haven't figured out how to predict when they will and
when they won't.


pgpo9BOlPemM3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Thomas Nau
Miles

On Sat, 2 Aug 2008, Miles Nordin wrote:
 tn == Thomas Nau [EMAIL PROTECTED] writes:

tn Nevertheless during the first hour of operation after onlining
tn we recognized numerous checksum errors on the formerly
tn offlined device. We decided to scrub the pool and after
tn several hours we got about 3500 error in 600GB of data.

 Did you use 'zpool offline' when you took them down, or did you
 offline them some other way, like by breaking the network connection,
 stopping the iSCSI target daemon, or 'iscsiadm remove
 discovery-address ..' on the initiator?

We did a zpool offline, nothing else, before we took the iSCSI server 
down


 Another iSCSI problem: for me, the targets I've 'zpool offline'd will
 automatically ONLINE themselves when iSCSI rediscovers them.  but only
 sometimes.  I haven't figured out how to predict when they will and
 when they won't.

I never experienced that one but we usually don't touch any of the iSCSI 
settings as long as a devices is offline. At least as long as we don't 
have to for any reason

Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Miles Nordin
 tn == Thomas Nau [EMAIL PROTECTED] writes:

tn I never experienced that one but we usually don't touch any of
tn the iSCSI settings as long as a devices is offline. At least
tn as long as we don't have to for any reason

Usually I do 'zpool offline' followed by 'iscsiadm remove
discovery-address ...'

This is for two reasons:

 1. At least with my old crappy Linux IET, it doesn't restore the
sessions unless I remove and add the discovery-address

 2. the auto-ONLINEing-on-discovery problem.  Removing the discovery
address makes absolutely sure ZFS doesn't ONLINE something before
I want it to.

If you have to do this maintenance again, you might want to try
removing the discovery address for reason #2.  Maybe when your iSCSI
target was coming back up, it bounced a bit.  so, when the target was
coming back up, you might have done the equivalent of removing the
target without 'zpool offline'ing first (and then immediately plugging
it back in).

That's the ritual I've been using anyway.  If anything unexpected
happens, I still have to manually scrub the whole pool to seek out all
these hidden ``checksum'' errors.

Hopefully some day you will be able to just look in fmdump and see
``yup, the target bounced once as it was coming back up.''  and
targets will be able to bounce as much as they like with
failmode=wait, or for short reasonable timeouts with other failmodes,
and automatically do fully-adequate but efficient resilvers with
proper dirty-region-logging without causing any latent checksum
errors.  and zpool offline'd devices will stay offline until reboot as
promised, and will never online themselves.  and iSCSI sessions will
always come up on their own without having to kick the initiator.


pgpPajiw7r2cN.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-23 Thread Jürgen Keil
I wrote:
 Bill Sommerfeld wrote:
  On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote:
I ran a scrub on a root pool after upgrading to snv_94, and got 
checksum errors:
   
   Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
   on a system that is running post snv_94 bits:  It also found checksum 
   errors
   
  once is accident.  twice is coincidence.  three times is enemy action :-)
  
  I'll file a bug as soon as I can 
 
 I filed 6727872, for the problem with zpool scrub checksum errors
 on unmounted zfs filesystems with an unplayed ZIL.

6727872 has already been fixed, in what will become snv_96.

For my zpool, zpool scrub doesn't report checksum errors any more.

But: something is still a bit strange with the data reported by zpool status.
The error counts displayed by zpool status are all 0 (during the scrub, and when
the scrub has completed), but when zpool scrub completes it tells me that
scrub completed after 0h58m with 6 errors.  But it doesn't list the errors.

# zpool status -v files
  pool: files
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: scrub in progress for 0h57m, 99.39% done, 0h0m to go
config:

NAME  STATE READ WRITE CKSUM
files ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c8t0d0s6  ONLINE   0 0 0
c9t0d0s6  ONLINE   0 0 0

errors: No known data errors


# zpool status -v files
  pool: files
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: scrub completed after 0h58m with 6 errors on Wed Jul 23 18:23:00 2008
config:

NAME  STATE READ WRITE CKSUM
files ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c8t0d0s6  ONLINE   0 0 0
c9t0d0s6  ONLINE   0 0 0

errors: No known data errors
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-21 Thread Jürgen Keil
Bill Sommerfeld wrote:

 On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote:
   I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
   errors:
  
  Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
  on a system that is running post snv_94 bits:  It also found checksum errors
  
 
 out of curiosity, is this a root pool?  

It started as standard pool, and is using version 3 zpool format.

I'm using a small ufs root, and have /usr as a zfs filesystem on
that pool.

At some point in the past i did setup a zfs root and /usr filesystem
for experimenting with xVM unstable bits.


 A second system of mine with a mirrored root pool (and an additional
 large multi-raidz pool) shows the same symptoms on the mirrored root
 pool only.
 
 once is accident.  twice is coincidence.  three times is enemy action :-)
 
 I'll file a bug as soon as I can (I'm travelling at the moment with
 spotty connectivity), citing my and your reports.

Btw. I also found the scrub checksum errors on a non-mirrored zpool
(laptop with only one hdd).

And on one zpool that was using a non-mirrored, striped pool on two
S-ATA drives.


I think that in my case the cause for the scrub checksum errors is an
open ZIL transaction on an *unmounted* zfs filesystem.  In the past
such a zfs state prevented creating snapshots for the unmounted zfs,
see bug 6482985, 6462803.  That is still the case.  But now it also
seems to trigger checksum errors for a zpool scrub.

Stack backtrace for the ECKSUM (which gets translated into EIO errors
in arc_read_done()):

  1  64703   arc_read_nolock:return, rval 5
  zfs`zil_read_log_block+0x140
  zfs`zil_parse+0x155
  zfs`traverse_zil+0x55
  zfs`scrub_visitbp+0x284
  zfs`scrub_visit_rootbp+0x4e
  zfs`scrub_visitds+0x82
  zfs`dsl_pool_scrub_sync+0x109
  zfs`dsl_pool_sync+0x158
  zfs`spa_sync+0x254
  zfs`txg_sync_thread+0x226
  unix`thread_start+0x8




Does a zdb -ivv {pool} report any ZIL headers with a claim_txg != 0
on your pools?  Is the dataset that is associated with such a ZIL an
unmounted zfs?

# zdb -ivv files | grep claim_txg
ZIL header: claim_txg 5164405, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 0, seq 0
ZIL header: claim_txg 5164405, seq 0
ZIL header: claim_txg 0, seq 0


# zdb -i files/matrix-usr
Dataset files/matrix-usr [ZPL], ID 216, cr_txg 5091978, 2.39G, 192089 objects

ZIL header: claim_txg 5164405, seq 0

first block: [L0 ZIL intent log] 1000L/1000P DVA[0]=0:12421e:1000 
zilog uncompressed LE contiguous birth=5163908 fill=0 
cksum=c368086f1485f7c4:39a549a81d769386:d8:3

Block seqno 3, already claimed, [L0 ZIL intent log] 1000L/1000P 
DVA[0]=0:12421e:1000 zilog uncompressed LE contiguous birth=5163908 
fill=0 cksum=c368086f1485f7c4:39a549a81d769386:d8:3


On two of my zpools I've eliminated the zpool scrub checksum errors by
mounting /  unmounting the zfs with the unplayed ZIL.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-21 Thread Jürgen Keil
Rustam wrote:
 
 I'm living with this error for almost 4 months and probably have record
 number of checksum errors:

 # zpool status -xv
   pool: box5
...
 errors: Permanent errors have been detected in the
 following files:
  
 box5:0x0

 I've Sol 10 U5 though.

I suspect that this (S10u5)  is a different issue, because for my
system's pool it seems to be caused by the opensolaris putback
on July 07th  for these fixes:

6343667 scrub/resilver has to start over when a snapshot is taken
6343693 'zpool status' gives delayed start for 'zpool scrub'
6670746 scrub on degraded pool return the status of 'resilver completed'?
6675685 DTL entries are lost resulting in checksum errors
6706404 get_history_one() can dereference off end of hist_event_table[]
6715414 assertion failed: ds-ds_owner != tag in dsl_dataset_rele()
6716437 ztest gets SEGV in arc_released()
6722838 bfu does not update grub
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-20 Thread Bill Sommerfeld
On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote:
  I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
  errors:
 
 Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
 on a system that is running post snv_94 bits:  It also found checksum errors
 
 # zpool status files
   pool: files
  state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
   attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
   using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
 config:
 
   NAME  STATE READ WRITE CKSUM
   files DEGRADED 0 018
 mirror  DEGRADED 0 018
   c8t0d0s6  DEGRADED 0 036  too many errors
   c9t0d0s6  DEGRADED 0 036  too many errors
 
 errors: No known data errors

out of curiosity, is this a root pool?  

A second system of mine with a mirrored root pool (and an additional
large multi-raidz pool) shows the same symptoms on the mirrored root
pool only.

once is accident.  twice is coincidence.  three times is enemy
action :-)

I'll file a bug as soon as I can (I'm travelling at the moment with
spotty connectivity), citing my and your reports.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-20 Thread dick hoogendijk
On Sun, 20 Jul 2008 11:26:16 -0700
Bill Sommerfeld [EMAIL PROTECTED] wrote:

 once is accident.  twice is coincidence.  three times is enemy
 action :-)

I have no access to b94 yet, but as it is, it probably is better to
skip this one when it comes out then.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
++ http://nagual.nl/ + SunOS sxce snv91 ++
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-18 Thread Jürgen Keil
 I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
 errors:

Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
on a system that is running post snv_94 bits:  It also found checksum errors

# zpool status files
  pool: files
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
config:

NAME  STATE READ WRITE CKSUM
files DEGRADED 0 018
  mirror  DEGRADED 0 018
c8t0d0s6  DEGRADED 0 036  too many errors
c9t0d0s6  DEGRADED 0 036  too many errors

errors: No known data errors


Addding the -v option to zpool status returned:


errors: Permanent errors have been detected in the following files:

metadata:0x0



OTOH, trying to verify checksums with zdb -c didn't find any problems:

# zdb -cvv files

Traversing all blocks to verify checksums and verify nothing leaked ...

No leaks (block sum matches space maps exactly)

bp count: 2804880
bp logical:121461614592  avg:  43303
bp physical:   84585684992   avg:  30156compression:   1.44
bp allocated:  85146115584   avg:  30356compression:   1.43
SPA allocated: 85146115584  used: 79.30%

951.08u 419.55s 2:24:34.32 15.8%
#
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-18 Thread Jürgen Keil
  I ran a scrub on a root pool after upgrading to snv_94, and got checksum 
  errors:
 
 Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
 on a system that is running post snv_94 bits:  It also found checksum errors
...
 OTOH, trying to verify checksums with zdb -c didn't
 find any problems:

And  a zpool scrub under snv_85 doesn't find checksum errors, either.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-18 Thread Rustam Aliyev
I'm living with this error for almost 4 months and probably have record 
number of checksum errors:


core# zpool status -xv
 pool: box5
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   box5ONLINE   0 0   856
 mirrorONLINE   0 0   428
   c1d0ONLINE   0 0   856
   c2d0ONLINE   0 0   856
 mirrorONLINE   0 0   428
   c2d1ONLINE   0 0   856
   c1d1ONLINE   0 0   856

errors: Permanent errors have been detected in the following files:

   box5:0x0


I've Sol 10 U5 though.

--
Rustam.


Jürgen Keil wrote:

I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors:



Hmm, after reading this, I started a zpool scrub on my mirrored pool, 
on a system that is running post snv_94 bits:  It also found checksum errors


# zpool status files
  pool: files
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008
config:

NAME  STATE READ WRITE CKSUM
files DEGRADED 0 018
  mirror  DEGRADED 0 018
c8t0d0s6  DEGRADED 0 036  too many errors
c9t0d0s6  DEGRADED 0 036  too many errors

errors: No known data errors


Addding the -v option to zpool status returned:


errors: Permanent errors have been detected in the following files:

metadata:0x0



OTOH, trying to verify checksums with zdb -c didn't find any problems:

# zdb -cvv files

Traversing all blocks to verify checksums and verify nothing leaked ...

No leaks (block sum matches space maps exactly)

bp count: 2804880
bp logical:121461614592  avg:  43303
bp physical:   84585684992   avg:  30156compression:   1.44
bp allocated:  85146115584   avg:  30356compression:   1.43
SPA allocated: 85146115584  used: 79.30%

951.08u 419.55s 2:24:34.32 15.8%
#
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] checksum errors on root pool after upgrade to snv_94

2008-07-17 Thread Bill Sommerfeld
I ran a scrub on a root pool after upgrading to snv_94, and got checksum
errors:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14
2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 2
  mirror  ONLINE   0 0 2
c4t0d0s0  ONLINE   0 0 4
c4t1d0s0  ONLINE   0 0 4

I ran it again, and it's now reporting the same errors, but still says
applications are unaffected:

  pool: r00t
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008
config:

NAME  STATE READ WRITE CKSUM
r00t  ONLINE   0 0 4
  mirror  ONLINE   0 0 4
c4t0d0s0  ONLINE   0 0 8
c4t1d0s0  ONLINE   0 0 8

errors: No known data errors


I wonder if I'm running into some combination of:

6725341 Running 'zpool scrub' repeatedly on a pool show an ever
increasing error count

and maybe:

6437568 ditto block repair is incorrectly propagated to root vdev

Any way to dig further to determine what's going on?

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors in storage pool

2007-03-15 Thread Hans-Juergen Schnitzer


In the meantime, the SUN supporter did figure out that zdb does not work
because zdb uses the information from /etc/zfs/zpool.cache. However,
I did use zpool -R to import the pool, which did not update 
/etc/zfs/zpool.cache. Is there another method to map a dataset

number to a filesystem?

Hans Schnitzer

H.-J. Schnitzer wrote:

Hi,

I am using ZFS under Solaris 10u3.

After the defect of a 3510 Raid controller, I have several storage pools
with defect objects. zpool status -xv prints a long list:

  DATASET  OBJECT  RANGE
  4c0c 5dd lvl=0 blkid=2
  28   b346lvl=0 blkid=9
  3b31 15d lvl=0 blkid=1
  3b31 15d lvl=0 blkid=2
  3b31 15d lvl=0 blkid=2727
  3b31 190 lvl=0 blkid=0
  ...

I know that the number in the column OBJECT identifies the inode number
of the affected file. 
However, I have more than 1000 filesystems  in each of the 
affected storage pools. So how do I identify the correct filesystem?

According to 
http://blogs.sun.com/erickustarz/entry/damaged_files_and_zpool_status
I have to use zdb. But I can't figure out how to use it. Can you help?

Hans Schnitzer
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Checksum errors in storage pool

2007-03-07 Thread H.-J. Schnitzer
Hi,

I am using ZFS under Solaris 10u3.

After the defect of a 3510 Raid controller, I have several storage pools
with defect objects. zpool status -xv prints a long list:

  DATASET  OBJECT  RANGE
  4c0c 5dd lvl=0 blkid=2
  28   b346lvl=0 blkid=9
  3b31 15d lvl=0 blkid=1
  3b31 15d lvl=0 blkid=2
  3b31 15d lvl=0 blkid=2727
  3b31 190 lvl=0 blkid=0
  ...

I know that the number in the column OBJECT identifies the inode number
of the affected file. 
However, I have more than 1000 filesystems  in each of the 
affected storage pools. So how do I identify the correct filesystem?
According to 
http://blogs.sun.com/erickustarz/entry/damaged_files_and_zpool_status
I have to use zdb. But I can't figure out how to use it. Can you help?

Hans Schnitzer
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksum errors...

2007-01-04 Thread eric kustarz



errors: The following persistent errors have been detected:

  DATASET  OBJECT  RANGE
  z_tsmsun1_pool/tsmsrv1_pool  26208464760832-8464891904

Looks like I have possibly a single file that is corrupted.  My question is how do I find the file.  Is it as simple as doing a find command using -inum 2620?  



FYI, i'm finishing up:
6410433 'zpool status -v' would be more useful with filenames

Which will give you the complete path to the file (if applicable), so 
you don't have to do a 'find' on the inum.


eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Checksum errors...

2006-12-28 Thread John
Background:
Large ZFS pool built on a couple of Sun 3511 SATA arrays. RAID-5 is done in the 
3511s. ZFS is non-redundant. We have been using this setup for a couple of 
months now with no issues.

Problem:
Yesterday afternoon we started getting checksum errors.  There have been no 
hardware errors reported at either the Solaris level or the hardware level.  
3511 logs are clean. Here is the zpool status:

tsmsun1 - /home/root zpool status -xv
  pool: z_tsmsun1_pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
z_tsmsun1_pool  ONLINE   0 0   180
  c22t600C0FF000678A0A86F3D901d0s0  ONLINE   0 0 0
  c22t600C0FF000678A0A86F3D900d0s0  ONLINE   0 0 0
  c22t600C0FF00068190A86F3D901d0s0  ONLINE   0 0 0
  c22t600C0FF00068190A86F3D900d0s0  ONLINE   0 0 0
  c22t600C0FF00068191A598ED500d0s0  ONLINE   0 0 0
  c22t600C0FF000678A1A598ED500d0s0  ONLINE   0 0 0
  c22t600C0FF00068191A598ED501d0s0  ONLINE   0 0 0
  c22t600C0FF000681943A7223100d0s0  ONLINE   0 0 0
  c22t600C0FF000681943A7223101d0ONLINE   0 0 0
  c22t600C0FF000681932BBD24400d0s0  ONLINE   0 0 0
  c22t600C0FF000681932BBD24401d0s0  ONLINE   0 0 0
  c22t600C0FF000678A43A7223100d0s0  ONLINE   0 0   180
  c22t600C0FF000678A2055211B01d0s0  ONLINE   0 0 0
  c22t600C0FF000678A2055211B00d0s0  ONLINE   0 0 0
  c22t600C0FF000678A32BBD24401d0s0  ONLINE   0 0 0
  c22t600C0FF000678A1A598ED501d0s0  ONLINE   0 0 0
  c22t600C0FF000678A32BBD24400d0s0  ONLINE   0 0 0
  c22t600C0FF000678A43A7223101d0s0  ONLINE   0 0 0
  c22t600C0FF00068192055211B00d0s0  ONLINE   0 0 0
  c22t600C0FF00068192055211B01d0s0  ONLINE   0 0 0
  c22t600C0FF000678A44F3D81B00d0s0  ONLINE   0 0 0
  c22t600C0FF000678A44F3D81B01d0s0  ONLINE   0 0 0
  c22t600C0FF000681944F3D81B00d0s0  ONLINE   0 0 0
  c22t600C0FF000681944F3D81B01d0s0  ONLINE   0 0 0

errors: The following persistent errors have been detected:

  DATASET  OBJECT  RANGE
  z_tsmsun1_pool/tsmsrv1_pool  26208464760832-8464891904

Looks like I have possibly a single file that is corrupted.  My question is how 
do I find the file.  Is it as simple as doing a find command using -inum 
2620?  

TIA,
john
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss