Re: [zfs-discuss] ZIL errors but device seems OK
Hi, After a little bit more digging I found in /var/adm/messages:- Mar 25 13:13:08 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:13:08 brszfs02timeout: early timeout, target=1 lun=0 Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1): Mar 25 13:13:08 brszfs02Error for command 'write sector'Error Level: Informational Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:13:08 brszfs02 gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0x3 Mar 25 13:13:43 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:13:43 brszfs02timeout: early timeout, target=1 lun=0 Mar 25 13:13:43 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:13:43 brszfs02timeout: early timeout, target=1 lun=0 Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1): Mar 25 13:13:43 brszfs02Error for command 'read sector' Error Level: Informational Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0x3 Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1): Mar 25 13:13:43 brszfs02Error for command 'read sector' Error Level: Informational Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:13:43 brszfs02 gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0x3 Mar 25 13:14:18 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:14:18 brszfs02timeout: early timeout, target=1 lun=0 Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1/c...@1,0 (Disk1): Mar 25 13:14:18 brszfs02Error for command 'read sector' Error Level: Informational Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.notice] Sense Key: aborted command Mar 25 13:14:18 brszfs02 gda: [ID 107833 kern.notice] Vendor 'Gen-ATA ' error code: 0x3 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:14:33 brszfs02timeout: abort request, target=0 lun=0 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:14:33 brszfs02timeout: abort device, target=0 lun=0 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:14:33 brszfs02timeout: reset target, target=0 lun=0 Mar 25 13:14:33 brszfs02 scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci-...@1f,2/i...@1 (ata1): Mar 25 13:14:33 brszfs02timeout: reset bus, target=0 lun=0 Mar 25 13:14:34 brszfs02 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major Mar 25 13:14:34 brszfs02 EVENT-TIME: Thu Mar 25 13:14:34 GMT 2010 Mar 25 13:14:34 brszfs02 PLATFORM: HP-Compaq-dc7700-Convertible-Minitower, CSN: CZC7264JN4, HOSTNAME: brszfs02 Mar 25 13:14:34 brszfs02 SOURCE: zfs-diagnosis, REV: 1.0 Mar 25 13:14:34 brszfs02 EVENT-ID: 6c0bd163-56bf-ee92-e393-ce2063355b52 Mar 25 13:14:34 brszfs02 DESC: The number of I/O errors associated with a ZFS device exceeded Mar 25 13:14:34 brszfs02 acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Mar 25 13:14:34 brszfs02 AUTO-RESPONSE: The device has been offlined and marked as faulted. An attempt Mar 25 13:14:34 brszfs02 will be made to activate a hot spare if available. Mar 25 13:14:34 brszfs02 IMPACT: Fault tolerance of the pool may be compromised. Mar 25 13:14:34 brszfs02 REC-ACTION: Run 'zpool status -x' and replace the bad device. If I remember correctly I was thrashing this pool with Bonnie++ at this time. Cheers Richard. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL errors but device seems OK
comment below... On Apr 14, 2010, at 1:49 AM, Richard Skelton wrote: > Hi, > I have installed OpenSolaris snv_134 from the iso at genunix.org. > Mon Mar 8 2010 New OpenSolaris preview, based on build 134 > I created a zpool:- >NAMESTATE READ WRITE CKSUM >tankONLINE 0 0 0 > c7t4d0ONLINE 0 0 0 > c7t5d0ONLINE 0 0 0 > c7t6d0ONLINE 0 0 0 > c7t8d0ONLINE 0 0 0 > c7t9d0ONLINE 0 0 0 >logs > c5d1p1ONLINE 0 0 0 >cache > c5d1p2ONLINE 0 0 0 > > The log device and cache are each one half of a 128GB OCZ VERTEX-TURBO flash > card. > > I am getting good NFS performance but have seen this error:- > r...@brszfs02:~# zpool status tank > pool: tank > state: DEGRADED > status: One or more devices are faulted in response to persistent errors. >Sufficient replicas exist for the pool to continue functioning in a >degraded state. > action: Replace the faulted device, or use 'zpool clear' to mark the device >repaired. > scrub: none requested > config: > >NAMESTATE READ WRITE CKSUM >tankDEGRADED 0 0 0 > c7t4d0ONLINE 0 0 0 > c7t5d0ONLINE 0 0 0 > c7t6d0ONLINE 0 0 0 > c7t8d0ONLINE 0 0 0 > c7t9d0ONLINE 0 0 0 >logs > c5d1p1FAULTED 0 4 0 too many errors >cache > c5d1p2ONLINE 0 0 0 > > errors: No known data errors > > r...@brszfs02:~# fmadm faulty > --- -- - > TIMEEVENT-ID MSG-ID SEVERITY > --- -- - > Mar 25 13:14:34 6c0bd163-56bf-ee92-e393-ce2063355b52 ZFS-8000-FDMajor > > Host: brszfs02 > Platform: HP-Compaq-dc7700-Convertible-MinitowerChassis_id : > CZC7264JN4 > Product_sn : > > Fault class : fault.fs.zfs.vdev.io > Affects : zfs://pool=tank/vdev=4ec464b5bf74a898 > faulted but still in service > Problem in : zfs://pool=tank/vdev=4ec464b5bf74a898 > faulted but still in service > > Description : The number of I/O errors associated with a ZFS device exceeded > acceptable levels. Refer to > http://sun.com/msg/ZFS-8000-FD > for more information. > > Response: The device has been offlined and marked as faulted. An attempt > will be made to activate a hot spare if available. > > Impact : Fault tolerance of the pool may be compromised. > > Action : Run 'zpool status -x' and replace the bad device. > > r...@brszfs02:~# iostat -En c5d1 > c5d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 > Model: OCZ VERTEX-TURB Revision: Serial No: 062F97G71C5T676 Size: 128.04GB > <128035160064 bytes> > Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 > Illegal Request: 0 > > > As there seems to be not hardware errors as reported by iostat I ran zpool > clear tank and a scrub on Monday. > Up to now I have seen no new errors, I have set-up a cron to scrub a 01:30 > each day. > > Is the flash card faulty or is this a ZFS problem? In my testing of Flash-based SSDs, this is the most common error. Since the drive is not reporting media errors or hard errors, the only interim conclusion is that something in the data path caused data to be corrupted. This can mean the drive doesn't report these errors, the errors are transient, or an error occurred which is not related to the data (eg. phantom writes). For example, my current bad-boy says: $ iostat -En ... c7t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: USB2.0 Product: VAULT DRIVE Revision: 1100 Serial No: Size: 8.12GB <8120172544 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 103 Predictive Failure Analysis: 0 ... $ pfexec zpool status -v syspool pool: syspool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed after 0h1m with 325 errors on Wed Apr 14 11:06:58 2010 config: NAMESTATE READ WRITE CKSUM syspool ONLINE 0 0 330
[zfs-discuss] ZIL errors but device seems OK
Hi, I have installed OpenSolaris snv_134 from the iso at genunix.org. Mon Mar 8 2010 New OpenSolaris preview, based on build 134 I created a zpool:- NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 c7t4d0ONLINE 0 0 0 c7t5d0ONLINE 0 0 0 c7t6d0ONLINE 0 0 0 c7t8d0ONLINE 0 0 0 c7t9d0ONLINE 0 0 0 logs c5d1p1ONLINE 0 0 0 cache c5d1p2ONLINE 0 0 0 The log device and cache are each one half of a 128GB OCZ VERTEX-TURBO flash card. I am getting good NFS performance but have seen this error:- r...@brszfs02:~# zpool status tank pool: tank state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scrub: none requested config: NAMESTATE READ WRITE CKSUM tankDEGRADED 0 0 0 c7t4d0ONLINE 0 0 0 c7t5d0ONLINE 0 0 0 c7t6d0ONLINE 0 0 0 c7t8d0ONLINE 0 0 0 c7t9d0ONLINE 0 0 0 logs c5d1p1FAULTED 0 4 0 too many errors cache c5d1p2ONLINE 0 0 0 errors: No known data errors r...@brszfs02:~# fmadm faulty --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Mar 25 13:14:34 6c0bd163-56bf-ee92-e393-ce2063355b52 ZFS-8000-FDMajor Host: brszfs02 Platform: HP-Compaq-dc7700-Convertible-MinitowerChassis_id : CZC7264JN4 Product_sn : Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=tank/vdev=4ec464b5bf74a898 faulted but still in service Problem in : zfs://pool=tank/vdev=4ec464b5bf74a898 faulted but still in service Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. Response: The device has been offlined and marked as faulted. An attempt will be made to activate a hot spare if available. Impact : Fault tolerance of the pool may be compromised. Action : Run 'zpool status -x' and replace the bad device. r...@brszfs02:~# iostat -En c5d1 c5d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: OCZ VERTEX-TURB Revision: Serial No: 062F97G71C5T676 Size: 128.04GB <128035160064 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 As there seems to be not hardware errors as reported by iostat I ran zpool clear tank and a scrub on Monday. Up to now I have seen no new errors, I have set-up a cron to scrub a 01:30 each day. Is the flash card faulty or is this a ZFS problem? Cheers Richard -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss