Interpretting 3Ware error messages

2010-05-18 Thread Doug Poland
Hello,

I have a 7.2-R i386 system running a 3ware 9500S-4LP SATA 150
controller with 4 SATA drives.  I recently starting seeing the
following in my logs

smartd[906]: Device: /dev/twa0 [3ware_disk_00], 1 Currently unreadable
(pending) sectors
smartd[906]: Device: /dev/twa0 [3ware_disk_00], 1 Offline
uncorrectable sectors

Using the twi_cli program, I can examine the disk subsystem, but I do
not see any issues with an underlying drive.

Unit UnitType  Status %RCmpl  %V/I/M  Port  Stripe  Size(GB)

u0   RAID-10   OK -   -   - 64K 298.002
u0-0 RAID-1OK -   -   - -   -
u0-0-0   DISK  OK -   -   p2-   149.001
u0-0-1   DISK  OK -   -   p3-   149.001
u0-1 RAID-1OK -   -   - -   -
u0-1-0   DISK  OK -   -   p0-   149.001
u0-1-1   DISK  OK -   -   p1-   149.001


I suspect a disk problem, but cannot identify the individual disk or
the nature of the problem.  Can anyone shed some light on this?


-- 
Regards,
Doug

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Interpretting 3Ware error messages

2010-05-18 Thread Matthew Seaman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 18/05/2010 15:43:25, Doug Poland wrote:
 Hello,
 
 I have a 7.2-R i386 system running a 3ware 9500S-4LP SATA 150
 controller with 4 SATA drives.  I recently starting seeing the
 following in my logs
 
 smartd[906]: Device: /dev/twa0 [3ware_disk_00], 1 Currently unreadable
 (pending) sectors
 smartd[906]: Device: /dev/twa0 [3ware_disk_00], 1 Offline
 uncorrectable sectors
 
 Using the twi_cli program, I can examine the disk subsystem, but I do
 not see any issues with an underlying drive.
 
 Unit UnitType  Status %RCmpl  %V/I/M  Port  Stripe  Size(GB)
 
 u0   RAID-10   OK -   -   - 64K 298.002
 u0-0 RAID-1OK -   -   - -   -
 u0-0-0   DISK  OK -   -   p2-   149.001
 u0-0-1   DISK  OK -   -   p3-   149.001
 u0-1 RAID-1OK -   -   - -   -
 u0-1-0   DISK  OK -   -   p0-   149.001
 u0-1-1   DISK  OK -   -   p1-   149.001
 
 
 I suspect a disk problem, but cannot identify the individual disk or
 the nature of the problem.  Can anyone shed some light on this?
 
 

Look at the SMART data for the disk(s) -- my guess is that you're seeing
sectors failing and being re-mapped by the drive firmware.  If this is
happening to any significant extent the disk may well be reaching the
end of its usable life: happily you would seem to have been alerted to
that in time to do something about it without needing to run around in a
blind panic.

There's a background task you can set up on 3ware controllers that will
attempt to access all sectors of a disk specifically to bring to light
problems like this, which otherwise could go unnoticed for a long time
and lead to silent data corruption.

Cheers,

Matthew

- -- 
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
  Kent, CT11 9PW
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.14 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvyqn8ACgkQ8Mjk52CukIyDJgCeI/olC6Qh4wA7nBfrUvfYy1fN
a1gAn2f8oXQ4YaJc4WcXt6EmEYIoM+ia
=qLER
-END PGP SIGNATURE-
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Interpretting 3Ware error messages

2010-05-18 Thread Doug Poland

On Tue, May 18, 2010 09:55, Matthew Seaman wrote:

 On 18/05/2010 15:43:25, Doug Poland wrote:
 Hello,

 I have a 7.2-R i386 system running a 3ware 9500S-4LP SATA 150
 controller with 4 SATA drives.  I recently starting seeing the
 following in my logs


 I suspect a disk problem, but cannot identify the individual disk
 or the nature of the problem.  Can anyone shed some light on this?



 Look at the SMART data for the disk(s) -- my guess is that you're
 seeing sectors failing and being re-mapped by the drive firmware.  If
 this is happening to any significant extent the disk may well be
 reaching the end of its usable life: happily you would seem to have
 been alerted to that in time to do something about it without needing
 to run around in a blind panic.

 There's a background task you can set up on 3ware controllers that
 will attempt to access all sectors of a disk specifically to bring to
 light problems like this, which otherwise could go unnoticed for a
 long time and lead to silent data corruption.

Will do, thanks for the info.


-- 
Regards,
Doug

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Interpretting 3Ware error messages

2010-05-18 Thread Michael Powell
Matthew Seaman wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 18/05/2010 15:43:25, Doug Poland wrote:
 Hello,
 
 I have a 7.2-R i386 system running a 3ware 9500S-4LP SATA 150
 controller with 4 SATA drives.  I recently starting seeing the
 following in my logs
 
 smartd[906]: Device: /dev/twa0 [3ware_disk_00], 1 Currently unreadable
 (pending) sectors
 smartd[906]: Device: /dev/twa0 [3ware_disk_00], 1 Offline
 ^^^
 uncorrectable sectors
   ^
I think this error usually indicates that there are sectors that are pending
remap, but will not get remapped or marked out until the next write occurs 
to them. On blank space these can easily be gotten rid of with a write from 
dd, however you don't want to be messing with this around active data.
 
 Using the twi_cli program, I can examine the disk subsystem, but I do
 not see any issues with an underlying drive.
 
 Unit UnitType  Status %RCmpl  %V/I/M  Port  Stripe  Size(GB)
 
 u0   RAID-10   OK -   -   - 64K 298.002
 u0-0 RAID-1OK -   -   - -   -
 u0-0-0   DISK  OK -   -   p2-   149.001
 u0-0-1   DISK  OK -   -   p3-   149.001
 u0-1 RAID-1OK -   -   - -   -
 u0-1-0   DISK  OK -   -   p0-   149.001
 u0-1-1   DISK  OK -   -   p1-   149.001
 
 
 I suspect a disk problem, but cannot identify the individual disk or
 the nature of the problem.  Can anyone shed some light on this?
 
 Look at the SMART data for the disk(s) -- my guess is that you're seeing
 sectors failing and being re-mapped by the drive firmware.  If this is
 happening to any significant extent the disk may well be reaching the
 end of its usable life: happily you would seem to have been alerted to
 that in time to do something about it without needing to run around in a
 blind panic.

If the remap area is not yet filled these should still get remapped at next 
write. If it is full replace the drive.
 
 There's a background task you can set up on 3ware controllers that will
 attempt to access all sectors of a disk specifically to bring to light
 problems like this, which otherwise could go unnoticed for a long time
 and lead to silent data corruption.

Many controllers refer to this as 'disk scrub' or 'disk verify'. If the 
remap zone still has space available a scrub should juggle sectors around 
and clear this counter.

Periodic scrubbing can find and fix the 'silent data corruption', which is 
data sectors which have failed between the time of the last write and the 
next read. When this pattern is spread out across multiple drives you won't 
know it until you have a drive go bad, pull it and replace, then find the 
array will not rebuild. I scrub my arrays every Friday night.

-Mike




___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org