RE: Disk Errors

2005-02-02 Thread Salyzyn, Mark
From: Douglas Gilbert [mailto:[EMAIL PROTECTED] writes:
 All may not be lost. If a medium error occurs and the ASC and
 ASCQ imply the sector could be read but
 failed ECC then the READ LONG SCSI command should fetch the
 block (plus ECC and other data). For example a Fujitsu MAM3184
 returns 576 bytes. It is probably too much to expect that all
 the damage will be in the last 64 bytes.

However, the drive has taken whatever action it could to reconstruct the
data, the failure to report the block for a standard read means that the
data is in fact `lost'. The data+ECC combination must be in a state
where there are more bits of damage than the error correction can deal
with; 64 bytes of ECC deals with single bit errors thus we know that we
have more than 1 bit of damage to the disk. We could have 4096 bits of
damage in the worst case :-) and never know that fact.

If I wanted in desperation to recover whatever data I could, this would
be grand, but as it stands, from the Linux File System Driver
perspective, it would be dangerous to accept this block as anything more
than it is.

If the data is of the form to permit some loss, for example video, audio
content or an error correcting stream of data, someone can make a case
where READ_LONG is an appropriate action to take to help fill in missing
content. 

A fun thought ...
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk Errors

2005-02-01 Thread Kit Gerrits
I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for
the pointer!).

Sda is the single-drive volume
(non-RAID, as it is only for the O/S,
which needs to be speedy and can be pulled from tape easily).

This explains several things:
A/ Why a single error can take an entire volume offline B/ Why the error is
not logged
If it only took the partition offline, 
it would still have been logged, 
as / is mounted from sda3

And leaves one question:
What caused the error?

There are no GROWN defects on the drive in this volume


---
Reference logs:
---

Executing: disk show defects (ID=0)
Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0

Executing: container list
Num  Total  Oth Chunk  Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
- -- -- --- -- --- -- -
 0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB
 /dev/sda NT
 1RAID-5 16.9GB   32KB Open0:01:0 64.0KB:8.47GB
 /dev/sdb DATA 0:02:0 64.0KB:8.47GB
   ?:??:?  - Missing - Mount points it
to:
# /dev/sda5 5.3G  1.5G  3.6G  30% /usr
 

 -Oorspronkelijk bericht-
 Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED]
 Verzonden: dinsdag 1 februari 2005 4:15
 Aan: Kit Gerrits
 Onderwerp: RE: Disk errors
 
 The controller does not appear to be busted; you have a Volume and a 
 RAID-5. Are you missing an Array?
 
 A two drive failure on a RAID-5 gives you an offline array.
 
 A single drive failure in a Volume gives you an offline array.
 
 You need to find who is 08:05, look through /dev for the major/minor 
 number and relate it to the 'device'. Look through /proc/scsi/scsi and 
 /var/messages to help correlate it.
 
 Sincerely -- Mark Salyzyn
 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk Errors

2005-02-01 Thread Douglas Gilbert
Kit Gerrits wrote:
I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for
the pointer!).
Sda is the single-drive volume
(non-RAID, as it is only for the O/S,
which needs to be speedy and can be pulled from tape easily).
This explains several things:
A/ Why a single error can take an entire volume offline B/ Why the error is
not logged
	If it only took the partition offline, 
	it would still have been logged, 
	as / is mounted from sda3

And leaves one question:
What caused the error?
There are no GROWN defects on the drive in this volume
Kit,
A block/sector is added to the grown defect list after it
has been reassigned. Reaasignment occurs automatically for
recoverable (medium) errors if the AWRE and/or ARRE bits are
set (those bits are in the read write error recovery mode page).
So there are two situations in which damaged blocks remain
accessible:
   1) unrecoverable medium errors
   2) recoverable medium errors when AWRE and/or ARRE
  are clear
Case 2) can be ignored ** or could be handled by setting
ARRE and then reading the whole disk (e.g. with dd). Both cases
can be handled with the REASSIGN BLOCKS SCSI command
once the defective logical block address (lba) or
addresses have been identified.
Using the sg3_utils package various things can be
done:
   - sginfo -e /dev/sda will show the AWRE and ARRE
 settings. Changing them with sginfo is a bit ugly
   - sginfo -G /dev/sda will show the grown defect list
 in index format (up to 3 other formats may be
 available)
   - sg_dd if=/dev/sg0 of=/dev/null bs=512 will read the
 whole disk or fail at the first unrecoverable (medium)
 error. If a medium error is detected the info
 field is the lba of the defect. ***
   - sg_reassign -a lba /dev/sda will reassign the
 lba block. If this succeeds lba should appear
 in the grown defect list (sginfo -G -Flogical /dev/sda).
When a logical block with unrecoverable errors is reassigned
then the new contents are vendor specific. I'm not sure how
file systems react to this.
** recoverable errors can be ignored. Assuming these
   recoverable errors occur on read operations then the
   read error counter log page's
   recovered error counter (one of them depending on the
   duration of the recovery process) will be incremented
*** due to error processing, it is still better to use /dev/sg0
rather than than /dev/sda with the sg_dd utility. Recent
changes (lk 2.6.11-rc2-bk8) make the following work:
sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512
in the presence of errors
Doug Gilbert
---
Reference logs:
---
Executing: disk show defects (ID=0)
Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0
Executing: container list
Num  Total  Oth Chunk  Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
- -- -- --- -- --- -- -
 0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB
 /dev/sda NT
 1RAID-5 16.9GB   32KB Open0:01:0 64.0KB:8.47GB
 /dev/sdb DATA 0:02:0 64.0KB:8.47GB
   ?:??:?  - Missing - Mount points it
to:
# /dev/sda5 5.3G  1.5G  3.6G  30% /usr
 


-Oorspronkelijk bericht-
Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED]
Verzonden: dinsdag 1 februari 2005 4:15
Aan: Kit Gerrits
Onderwerp: RE: Disk errors
The controller does not appear to be busted; you have a Volume and a 
RAID-5. Are you missing an Array?

A two drive failure on a RAID-5 gives you an offline array.
A single drive failure in a Volume gives you an offline array.
You need to find who is 08:05, look through /dev for the major/minor 
number and relate it to the 'device'. Look through /proc/scsi/scsi and 
/var/messages to help correlate it.

Sincerely -- Mark Salyzyn

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Disk Errors

2005-02-01 Thread Salyzyn, Mark
Good information for a single drive on a simple SCSI card. This will not
work for drives that are part of an array (volume) as /dev/sda
references a pseudo device. Besides, the firmware in the RAID controller
takes the actions necessary to perform recoverable bad block remaps.

Sincerely -- Mark Salyzyn

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Douglas Gilbert
Sent: Tuesday, February 01, 2005 7:44 AM
To: Kit Gerrits
Cc: linux-scsi@vger.kernel.org
Subject: Re: Disk Errors

Kit Gerrits wrote:
 I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks
for
 the pointer!).
 
 Sda is the single-drive volume
 (non-RAID, as it is only for the O/S,
 which needs to be speedy and can be pulled from tape easily).
 
 This explains several things:
 A/ Why a single error can take an entire volume offline B/ Why the
error is
 not logged
   If it only took the partition offline, 
   it would still have been logged, 
   as / is mounted from sda3
 
 And leaves one question:
 What caused the error?
 
 There are no GROWN defects on the drive in this volume

Kit,
A block/sector is added to the grown defect list after it
has been reassigned. Reaasignment occurs automatically for
recoverable (medium) errors if the AWRE and/or ARRE bits are
set (those bits are in the read write error recovery mode page).

So there are two situations in which damaged blocks remain
accessible:
1) unrecoverable medium errors
2) recoverable medium errors when AWRE and/or ARRE
   are clear

Case 2) can be ignored ** or could be handled by setting
ARRE and then reading the whole disk (e.g. with dd). Both cases
can be handled with the REASSIGN BLOCKS SCSI command
once the defective logical block address (lba) or
addresses have been identified.

Using the sg3_utils package various things can be
done:
- sginfo -e /dev/sda will show the AWRE and ARRE
  settings. Changing them with sginfo is a bit ugly
- sginfo -G /dev/sda will show the grown defect list
  in index format (up to 3 other formats may be
  available)
- sg_dd if=/dev/sg0 of=/dev/null bs=512 will read the
  whole disk or fail at the first unrecoverable (medium)
  error. If a medium error is detected the info
  field is the lba of the defect. ***
- sg_reassign -a lba /dev/sda will reassign the
  lba block. If this succeeds lba should appear
  in the grown defect list (sginfo -G -Flogical /dev/sda).

When a logical block with unrecoverable errors is reassigned
then the new contents are vendor specific. I'm not sure how
file systems react to this.


** recoverable errors can be ignored. Assuming these
recoverable errors occur on read operations then the
read error counter log page's
recovered error counter (one of them depending on the
duration of the recovery process) will be incremented

*** due to error processing, it is still better to use /dev/sg0
 rather than than /dev/sda with the sg_dd utility. Recent
 changes (lk 2.6.11-rc2-bk8) make the following work:
 sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512
 in the presence of errors

Doug Gilbert

 ---
 Reference logs:
 ---
 
 Executing: disk show defects (ID=0)
 Number of PRIMARY defects on drive: 1912 Number of GROWN defects on
drive: 0
 
 Executing: container list
 Num  Total  Oth Chunk  Scsi   Partition
 Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
 - -- -- --- -- --- -- -
  0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB
  /dev/sda NT
  1RAID-5 16.9GB   32KB Open0:01:0 64.0KB:8.47GB
  /dev/sdb DATA 0:02:0 64.0KB:8.47GB
?:??:?  - Missing - Mount
points it
 to:
 # /dev/sda5 5.3G  1.5G  3.6G  30% /usr
  
 
 
-Oorspronkelijk bericht-
Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED]
Verzonden: dinsdag 1 februari 2005 4:15
Aan: Kit Gerrits
Onderwerp: RE: Disk errors

The controller does not appear to be busted; you have a Volume and a 
RAID-5. Are you missing an Array?

A two drive failure on a RAID-5 gives you an offline array.

A single drive failure in a Volume gives you an offline array.

You need to find who is 08:05, look through /dev for the major/minor 
number and relate it to the 'device'. Look through /proc/scsi/scsi and

/var/messages to help correlate it.

Sincerely -- Mark Salyzyn

 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-scsi
in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED

RE: Disk Errors

2005-02-01 Thread Cress, Andrew R
Kit,

If you have another (non-RAID) SCSI system, you could take the faulty
drive there to modify the mode pages to turn on AWRE and ARRE with
either sgmode (scsirastools.sf.net) or sginfo (sg3_utils).

Otherwise, you are dependent on the tools that are provided for the
PowerEdge RAID controller.

Andy

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Douglas Gilbert
Sent: Tuesday, February 01, 2005 7:44 AM
To: Kit Gerrits
Cc: linux-scsi@vger.kernel.org
Subject: Re: Disk Errors

Kit Gerrits wrote:
 I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks
for
 the pointer!).
 
 Sda is the single-drive volume
 (non-RAID, as it is only for the O/S,
 which needs to be speedy and can be pulled from tape easily).
 
 This explains several things:
 A/ Why a single error can take an entire volume offline B/ Why the
error is
 not logged
   If it only took the partition offline, 
   it would still have been logged, 
   as / is mounted from sda3
 
 And leaves one question:
 What caused the error?
 
 There are no GROWN defects on the drive in this volume

Kit,
A block/sector is added to the grown defect list after it
has been reassigned. Reaasignment occurs automatically for
recoverable (medium) errors if the AWRE and/or ARRE bits are
set (those bits are in the read write error recovery mode page).

So there are two situations in which damaged blocks remain
accessible:
1) unrecoverable medium errors
2) recoverable medium errors when AWRE and/or ARRE
   are clear

Case 2) can be ignored ** or could be handled by setting
ARRE and then reading the whole disk (e.g. with dd). Both cases
can be handled with the REASSIGN BLOCKS SCSI command
once the defective logical block address (lba) or
addresses have been identified.

Using the sg3_utils package various things can be
done:
- sginfo -e /dev/sda will show the AWRE and ARRE
  settings. Changing them with sginfo is a bit ugly
- sginfo -G /dev/sda will show the grown defect list
  in index format (up to 3 other formats may be
  available)
- sg_dd if=/dev/sg0 of=/dev/null bs=512 will read the
  whole disk or fail at the first unrecoverable (medium)
  error. If a medium error is detected the info
  field is the lba of the defect. ***
- sg_reassign -a lba /dev/sda will reassign the
  lba block. If this succeeds lba should appear
  in the grown defect list (sginfo -G -Flogical /dev/sda).

When a logical block with unrecoverable errors is reassigned
then the new contents are vendor specific. I'm not sure how
file systems react to this.


** recoverable errors can be ignored. Assuming these
recoverable errors occur on read operations then the
read error counter log page's
recovered error counter (one of them depending on the
duration of the recovery process) will be incremented

*** due to error processing, it is still better to use /dev/sg0
 rather than than /dev/sda with the sg_dd utility. Recent
 changes (lk 2.6.11-rc2-bk8) make the following work:
 sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512
 in the presence of errors

Doug Gilbert

 ---
 Reference logs:
 ---
 
 Executing: disk show defects (ID=0)
 Number of PRIMARY defects on drive: 1912 Number of GROWN defects on
drive: 0
 
 Executing: container list
 Num  Total  Oth Chunk  Scsi   Partition
 Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
 - -- -- --- -- --- -- -
  0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB
  /dev/sda NT
  1RAID-5 16.9GB   32KB Open0:01:0 64.0KB:8.47GB
  /dev/sdb DATA 0:02:0 64.0KB:8.47GB
?:??:?  - Missing - Mount
points it
 to:
 # /dev/sda5 5.3G  1.5G  3.6G  30% /usr
  
 
 
-Oorspronkelijk bericht-
Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED]
Verzonden: dinsdag 1 februari 2005 4:15
Aan: Kit Gerrits
Onderwerp: RE: Disk errors

The controller does not appear to be busted; you have a Volume and a 
RAID-5. Are you missing an Array?

A two drive failure on a RAID-5 gives you an offline array.

A single drive failure in a Volume gives you an offline array.

You need to find who is 08:05, look through /dev for the major/minor 
number and relate it to the 'device'. Look through /proc/scsi/scsi and

/var/messages to help correlate it.

Sincerely -- Mark Salyzyn

 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-scsi
in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED

Re: Disk Errors

2005-02-01 Thread Bryan Henderson
So there are two situations in which damaged blocks remain
accessible:
1) unrecoverable medium errors
 ...

What's the rationale behind leaving a damaged block accessible in the case 
of an unrecoverable medium error?  A possibility that someone might 
actually be able to recover the data?

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk Errors

2005-02-01 Thread Douglas Gilbert
Salyzyn, Mark wrote:
An unrecoverable medium error is typically `corrected' when a write to
the block occurs. RAID cards will use the redundancy to calculate the
data and write it back to the offending drive for instance.
Otherwise, for none-redundant stores, bad media is as good as anything
to remind one that the data is gone ;-
Sincerely -- Mark Salyzyn
All may not be lost. If a medium error occurs and the ASC and
ASCQ imply the sector could be read but
failed ECC then the READ LONG SCSI command should fetch the
block (plus ECC and other data). For example a Fujitsu MAM3184
returns 576 bytes. It is probably too much to expect that all
the damage will be in the last 64 bytes.
As Mark pointed out, if /dev/sda is a virtual disk then it is
unlikely that the READ LONG SCSI command will be supported.
sg3_utils has a sg_read_long utility. Long blocks can
be written to the media with the sg_write_long utility
which was introduced mainly for testing (e.g. creating
artificial medium errors).
BTW I noticed that the block layer reads around a medium
error. Say 8 KB is being read and a medium error occurs
(and the info field is set to the lba of the first failure)
then several small reads are done to reconstruct as much
of the original 8 KB as possible (probably with a block of
zeroes corresponding to the medium error).
Doug Gilbert
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Bryan Henderson
Sent: Tuesday, February 01, 2005 1:01 PM
To: [EMAIL PROTECTED]
Cc: Kit Gerrits; linux-scsi@vger.kernel.org
Subject: Re: Disk Errors

So there are two situations in which damaged blocks remain
accessible:
  1) unrecoverable medium errors
...

What's the rationale behind leaving a damaged block accessible in the
case 
of an unrecoverable medium error?  A possibility that someone might 
actually be able to recover the data?
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Disk errors

2005-01-31 Thread Kit Gerrits
Andrew,

Thanks for explaining the initial vs grown error list.
 
Unfortunately, the tool itself monitors softwareRAID and SCSI devices.
This means that sgmode itself sees only the containers on the PERC.


Would you happen to know how to accomplish this in afacli?


AFA0 disk set ?
disk set default - Sets the various disk defaults for all subsequent CLI
commands.
disk set smart - Change a device's SMART configuration.

AFA0 disk show ?
disk show default - Shows the various defaults set for the CLI commands.
disk show defects - Shows the number of defects and/or defect list on a
particular disk drive.
disk show partition - Shows the partitions on the disks attached to this
controller.
disk show smart - Displays SMART values and settings for SMART enabled
devices.
disk show space - Shows space usage on the disks attached to the controller.

AFA0 disk show default
Executing: disk show default
No Default

AFA0disk show smart
Executing: disk show smart
SmartMethod of Enable
Capable  Informational Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  ControlEnabled  Count
--  ---    -  ---  --
0:00:0 Y6 Y   N 0
0:01:0 Y6 Y   N 0
0:02:0 Y6 Y   N 0
0:03:0 Y6 Y   N 0
0:06:0 N


Thanks for the info

Kit


 -Oorspronkelijk bericht-
 Van: Cress, Andrew R [mailto:[EMAIL PROTECTED] 
 Verzonden: maandag 31 januari 2005 15:46
 Aan: Kit Gerrits; linux-scsi@vger.kernel.org
 Onderwerp: RE: Disk errors
 
 Kit,
 
 With the growing size of disk drives, and a more sectors 
 allocated to reserve sectors, the number of defects alone is 
 not a big concern, expecially if they are PRIMARY defects 
 (found at manufacture-time).
 What would be of concern, is an increase in the number of 
 GROWN defects over a short period of time.  Unfortunately, it 
 is quite common for one defect to cause a disk to be 
 replaced, when it could be remapped without the expense and 
 trouble of a field replacement.
 
 The automatic remapping of grown defects is a feature of SCSI 
 disks, but may not be configured in the disk's mode pages.  
 The mode pages can be changed without affecting the content 
 of the disk (with the exception of size  sector mapping 
 parameters).  There are several Linux tools to read/set mode 
 pages, among which is 'sgmode' from http://scsirastools.sf.net.
 
 As a guess, it appears that you had a grown defect occur on 
 one of your disks, but the remapping was not set to occur 
 automatically on that disk, so a write never finished.
 
 Andy
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Kit Gerrits
 Sent: Monday, January 31, 2005 9:28 AM
 To: linux-scsi@vger.kernel.org
 Subject: Disk errors
 
 
 Exactly how many errors is a SCSI disk allowed to have?
 
 I have a PE2400 with a PERC2/Si with 4x9GB
 
 My disks show:
 AFA0 disk show defects 0
 Executing: disk show defects (ID=0)
 Number of PRIMARY defects on drive: 1912 Number of GROWN 
 defects on drive: 0
 
 AFA0 disk show defects 1
 Executing: disk show defects (ID=1)
 Number of PRIMARY defects on drive: 952
 Number of GROWN defects on drive: 1
 
 AFA0 disk show defects 2
 Executing: disk show defects (ID=2)
 Number of PRIMARY defects on drive: 2457 Number of GROWN 
 defects on drive: 0
 
 AFA0 disk show defects 3
 Executing: disk show defects (ID=3)
 Number of PRIMARY defects on drive: 2794 Number of GROWN 
 defects on drive: 0
 
 The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) 
 has recently hung with the error:
 
 I/O Error Dev 08:05 Sector 529712
 
 I would assume that this error is generated by the harddrive, 
 but shouldn't the controller catch SCSI errors (and relocate 
 sectors automagically)?
 
 Thanks in advance,
 
 Kit Gerrits
 
 [EMAIL PROTECTED]
 
 -
 To unsubscribe from this list: send the line unsubscribe 
 linux-scsi in the body of a message to 
 [EMAIL PROTECTED] More majordomo info at  
 http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Disk errors

2005-01-31 Thread Kit Gerrits
Indeed, I had an entire screenful of errors (a few each second) when I came
in in the morning...
The strange thing is, that the drive with the grown error is part of the
DATA container (/home and /data), whilst the disk with the rest ( / ) was
fine.

You'd expect the error to show  up in /var/log/messages, but it didn't. 
I think the entire controller gave up as soon as the error popped up.

-
Is there a way of having the controller detect / handle grown errors?
Will setting automatic remapping handle this?

Does Anyone know how to read / write mode pages?


Thanks all!

Kit

 -Oorspronkelijk bericht-
 Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] 
 Verzonden: maandag 31 januari 2005 17:03
 Aan: Kit Gerrits
 Onderwerp: RE: Disk errors
 
 You get tones of I/O error messages from the filesystem 
 driver once the device goes offline. You can check 
 /var/log/messages to find the root cause.
 
 You will need to run the RAID management tools (afacli) to 
 display the underlying components (container list). Dell has 
 their own customized tools for this, I can not comment on their usage.
 
 Sincerely -- Mark Salyzyn
 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Disk errors

2005-01-31 Thread Cress, Andrew R

I don't know much about agacli.

The mode pages do have bits to enable SMART, but that's not what I think
you are interested in.
However, SMART can generate info events that the OS may not be
recognizing.  

What you are interested is mode page 0x01 to see if AWRE and ARRE are
turned on (bits 7  6, 0xC0).
The default setting for these may be documented in the disk manual for
your drives also, which can be obtained from the vendor web site.  Or,
the PERC vendor may be able to help get this info.

Andy

-Original Message-
From: Kit Gerrits [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 31, 2005 10:22 AM
To: Cress, Andrew R; linux-scsi@vger.kernel.org
Subject: RE: Disk errors


Andrew,

Thanks for explaining the initial vs grown error list.
 
Unfortunately, the tool itself monitors softwareRAID and SCSI devices.
This means that sgmode itself sees only the containers on the PERC.


Would you happen to know how to accomplish this in afacli?


AFA0 disk set ?
disk set default - Sets the various disk defaults for all subsequent CLI
commands.
disk set smart - Change a device's SMART configuration.

AFA0 disk show ?
disk show default - Shows the various defaults set for the CLI commands.
disk show defects - Shows the number of defects and/or defect list on a
particular disk drive.
disk show partition - Shows the partitions on the disks attached to this
controller.
disk show smart - Displays SMART values and settings for SMART enabled
devices.
disk show space - Shows space usage on the disks attached to the
controller.

AFA0 disk show default
Executing: disk show default
No Default

AFA0disk show smart
Executing: disk show smart
SmartMethod of Enable
Capable  Informational Exception  Performance  Error
B:ID:L  Device   Exceptions(MRIE)  ControlEnabled  Count
--  ---    -  ---  --
0:00:0 Y6 Y   N 0
0:01:0 Y6 Y   N 0
0:02:0 Y6 Y   N 0
0:03:0 Y6 Y   N 0
0:06:0 N


Thanks for the info

Kit


 -Oorspronkelijk bericht-
 Van: Cress, Andrew R [mailto:[EMAIL PROTECTED] 
 Verzonden: maandag 31 januari 2005 15:46
 Aan: Kit Gerrits; linux-scsi@vger.kernel.org
 Onderwerp: RE: Disk errors
 
 Kit,
 
 With the growing size of disk drives, and a more sectors 
 allocated to reserve sectors, the number of defects alone is 
 not a big concern, expecially if they are PRIMARY defects 
 (found at manufacture-time).
 What would be of concern, is an increase in the number of 
 GROWN defects over a short period of time.  Unfortunately, it 
 is quite common for one defect to cause a disk to be 
 replaced, when it could be remapped without the expense and 
 trouble of a field replacement.
 
 The automatic remapping of grown defects is a feature of SCSI 
 disks, but may not be configured in the disk's mode pages.  
 The mode pages can be changed without affecting the content 
 of the disk (with the exception of size  sector mapping 
 parameters).  There are several Linux tools to read/set mode 
 pages, among which is 'sgmode' from http://scsirastools.sf.net.
 
 As a guess, it appears that you had a grown defect occur on 
 one of your disks, but the remapping was not set to occur 
 automatically on that disk, so a write never finished.
 
 Andy
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Kit Gerrits
 Sent: Monday, January 31, 2005 9:28 AM
 To: linux-scsi@vger.kernel.org
 Subject: Disk errors
 
 
 Exactly how many errors is a SCSI disk allowed to have?
 
 I have a PE2400 with a PERC2/Si with 4x9GB
 
 My disks show:
 AFA0 disk show defects 0
 Executing: disk show defects (ID=0)
 Number of PRIMARY defects on drive: 1912 Number of GROWN 
 defects on drive: 0
 
 AFA0 disk show defects 1
 Executing: disk show defects (ID=1)
 Number of PRIMARY defects on drive: 952
 Number of GROWN defects on drive: 1
 
 AFA0 disk show defects 2
 Executing: disk show defects (ID=2)
 Number of PRIMARY defects on drive: 2457 Number of GROWN 
 defects on drive: 0
 
 AFA0 disk show defects 3
 Executing: disk show defects (ID=3)
 Number of PRIMARY defects on drive: 2794 Number of GROWN 
 defects on drive: 0
 
 The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) 
 has recently hung with the error:
 
 I/O Error Dev 08:05 Sector 529712
 
 I would assume that this error is generated by the harddrive, 
 but shouldn't the controller catch SCSI errors (and relocate 
 sectors automagically)?
 
 Thanks in advance,
 
 Kit Gerrits
 
 [EMAIL PROTECTED]
 
 -
 To unsubscribe from this list: send the line unsubscribe 
 linux-scsi in the body of a message to 
 [EMAIL PROTECTED] More majordomo info at  
 http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body

RE: Disk errors

2005-01-31 Thread Salyzyn, Mark
The PERC controller looks after bad block reassignment.

Sincerely -- Mark Salyzyn

-Original Message-
From: Kit Gerrits [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 31, 2005 11:44 AM
To: Salyzyn, Mark
Cc: linux-scsi@vger.kernel.org
Subject: RE: Disk errors

Indeed, I had an entire screenful of errors (a few each second) when I
came
in in the morning...
The strange thing is, that the drive with the grown error is part of the
DATA container (/home and /data), whilst the disk with the rest ( / )
was
fine.

You'd expect the error to show  up in /var/log/messages, but it didn't. 
I think the entire controller gave up as soon as the error popped up.

-
Is there a way of having the controller detect / handle grown errors?
Will setting automatic remapping handle this?

Does Anyone know how to read / write mode pages?


Thanks all!

Kit

 -Oorspronkelijk bericht-
 Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] 
 Verzonden: maandag 31 januari 2005 17:03
 Aan: Kit Gerrits
 Onderwerp: RE: Disk errors
 
 You get tones of I/O error messages from the filesystem 
 driver once the device goes offline. You can check 
 /var/log/messages to find the root cause.
 
 You will need to run the RAID management tools (afacli) to 
 display the underlying components (container list). Dell has 
 their own customized tools for this, I can not comment on their usage.
 
 Sincerely -- Mark Salyzyn
 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Disk errors

2005-01-31 Thread Kit Gerrits
But if the PERC (controller) handles disk errors, what could cause:

I/O Error Dev 08:05 Sector 529712

I would assume that this error is generated by the harddrive, but shouldn't
the controller catch SCSI errors (and relocate sectors automagically)?

Kit

SCSI relevant DMESG:
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
Adaptec aic7880 Ultra SCSI adapter
aic7880: Ultra Single Channel A, SCSI Id=7, 16/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
Adaptec 2940 Ultra2 SCSI adapter
aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs

blk: queue d7ab8814, I/O limit 4095Mb (mask 0x)
(scsi0:A:5): 20.000MB/s transfers (20.000MHz, offset 15)
  Vendor: NEC   Model: CD-ROM DRIVE:466  Rev: 1.06
  Type:   CD-ROM ANSI SCSI revision: 02
blk: queue c1fc1e14, I/O limit 4095Mb (mask 0x)
(scsi1:A:6): 20.000MB/s transfers (10.000MHz, offset 15, 16bit)
  Vendor: QUANTUM   Model: DLT7000   Rev: 2561
  Type:   Sequential-Access  ANSI SCSI revision: 02
blk: queue c1fc1a14, I/O limit 4095Mb (mask 0x)
Red Hat/Adaptec aacraid driver (1.1.2 Jun 29 2004 18:26:27)
PCI: Found IRQ 14 for device 00:02.1
AAC0: kernel 2.1.4 build 2939
AAC0: monitor 2.1.4 build 2939
AAC0: bios 2.1.0 build 2939
AAC0: serial 410010d0fafaf001
spurious 8259A interrupt: IRQ7.
scsi2 : percraid
  Vendor: DELL  Model: PERCRAID Volume   Rev: V1.0
  Type:   Direct-Access  ANSI SCSI revision: 02
blk: queue c1fc1c14, I/O limit 4095Mb (mask 0x)
  Vendor: DELL  Model: PERCRAID RAID5Rev: V1.0
  Type:   Direct-Access  ANSI SCSI revision: 02
blk: queue d7ab9e14, I/O limit 4095Mb (mask 0x)
Attached scsi removable disk sda at scsi2, channel 0, id 0, lun 0
Attached scsi removable disk sdb at scsi2, channel 0, id 1, lun 0
SCSI device sda: 17771136 512-byte hdwr sectors (9099 MB)
sda: Write Protect is off
Partition check:
 sda: sda1 sda2 sda3 sda4  sda5 sda6 
SCSI device sdb: 35542272 512-byte hdwr sectors (18198 MB)
sdb: Write Protect is off
 sdb: sdb1 sdb2 

 -Oorspronkelijk bericht-
 Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] 
 Verzonden: maandag 31 januari 2005 19:22
 Aan: Kit Gerrits
 CC: linux-scsi@vger.kernel.org
 Onderwerp: RE: Disk errors
 
 The PERC controller looks after bad block reassignment.
 
 Sincerely -- Mark Salyzyn
 
 -Original Message-
 From: Kit Gerrits [mailto:[EMAIL PROTECTED]
 Sent: Monday, January 31, 2005 11:44 AM
 To: Salyzyn, Mark
 Cc: linux-scsi@vger.kernel.org
 Subject: RE: Disk errors
 
 Indeed, I had an entire screenful of errors (a few each 
 second) when I came in in the morning...
 The strange thing is, that the drive with the grown error is 
 part of the DATA container (/home and /data), whilst the disk 
 with the rest ( / ) was fine.
 
 You'd expect the error to show  up in /var/log/messages, but 
 it didn't. 
 I think the entire controller gave up as soon as the error popped up.
 
 -
 Is there a way of having the controller detect / handle grown errors?
 Will setting automatic remapping handle this?
 
 Does Anyone know how to read / write mode pages?
 
 
 Thanks all!
 
 Kit
 
  -Oorspronkelijk bericht-
  Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED]
  Verzonden: maandag 31 januari 2005 17:03
  Aan: Kit Gerrits
  Onderwerp: RE: Disk errors
  
  You get tones of I/O error messages from the filesystem driver once 
  the device goes offline. You can check /var/log/messages to 
 find the 
  root cause.
  
  You will need to run the RAID management tools (afacli) to 
 display the 
  underlying components (container list). Dell has their own 
 customized 
  tools for this, I can not comment on their usage.
  
  Sincerely -- Mark Salyzyn
  
 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk errors

2005-01-31 Thread Matt Domsch
On Tue, Feb 01, 2005 at 12:41:13AM +0100, Kit Gerrits wrote:
 But if the PERC (controller) handles disk errors, what could cause:
 
 I/O Error Dev 08:05 Sector 529712
 
 I would assume that this error is generated by the harddrive, but shouldn't
 the controller catch SCSI errors (and relocate sectors automagically)?

In this case, the RAID controller is reporting the I/O error.  It may
be that you've got bad sectors on more than one physical disk, in the
same stripe, and the RAID controller can't fix them.

Thanks,
Matt

-- 
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com  www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Disk errors

2005-01-31 Thread Guy
Maybe you have a failed disk, and another has bad blocks.  So, no good copy
of the data exists.  Just a guess!!!

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Kit Gerrits
Sent: Monday, January 31, 2005 6:41 PM
To: 'Salyzyn, Mark'
Cc: linux-scsi@vger.kernel.org
Subject: RE: Disk errors

But if the PERC (controller) handles disk errors, what could cause:

I/O Error Dev 08:05 Sector 529712

I would assume that this error is generated by the harddrive, but shouldn't
the controller catch SCSI errors (and relocate sectors automagically)?

Kit

SCSI relevant DMESG:
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
Adaptec aic7880 Ultra SCSI adapter
aic7880: Ultra Single Channel A, SCSI Id=7, 16/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
Adaptec 2940 Ultra2 SCSI adapter
aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs

blk: queue d7ab8814, I/O limit 4095Mb (mask 0x)
(scsi0:A:5): 20.000MB/s transfers (20.000MHz, offset 15)
  Vendor: NEC   Model: CD-ROM DRIVE:466  Rev: 1.06
  Type:   CD-ROM ANSI SCSI revision: 02
blk: queue c1fc1e14, I/O limit 4095Mb (mask 0x)
(scsi1:A:6): 20.000MB/s transfers (10.000MHz, offset 15, 16bit)
  Vendor: QUANTUM   Model: DLT7000   Rev: 2561
  Type:   Sequential-Access  ANSI SCSI revision: 02
blk: queue c1fc1a14, I/O limit 4095Mb (mask 0x)
Red Hat/Adaptec aacraid driver (1.1.2 Jun 29 2004 18:26:27)
PCI: Found IRQ 14 for device 00:02.1
AAC0: kernel 2.1.4 build 2939
AAC0: monitor 2.1.4 build 2939
AAC0: bios 2.1.0 build 2939
AAC0: serial 410010d0fafaf001
spurious 8259A interrupt: IRQ7.
scsi2 : percraid
  Vendor: DELL  Model: PERCRAID Volume   Rev: V1.0
  Type:   Direct-Access  ANSI SCSI revision: 02
blk: queue c1fc1c14, I/O limit 4095Mb (mask 0x)
  Vendor: DELL  Model: PERCRAID RAID5Rev: V1.0
  Type:   Direct-Access  ANSI SCSI revision: 02
blk: queue d7ab9e14, I/O limit 4095Mb (mask 0x)
Attached scsi removable disk sda at scsi2, channel 0, id 0, lun 0
Attached scsi removable disk sdb at scsi2, channel 0, id 1, lun 0
SCSI device sda: 17771136 512-byte hdwr sectors (9099 MB)
sda: Write Protect is off
Partition check:
 sda: sda1 sda2 sda3 sda4  sda5 sda6 
SCSI device sdb: 35542272 512-byte hdwr sectors (18198 MB)
sdb: Write Protect is off
 sdb: sdb1 sdb2 

 -Oorspronkelijk bericht-
 Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] 
 Verzonden: maandag 31 januari 2005 19:22
 Aan: Kit Gerrits
 CC: linux-scsi@vger.kernel.org
 Onderwerp: RE: Disk errors
 
 The PERC controller looks after bad block reassignment.
 
 Sincerely -- Mark Salyzyn
 
 -Original Message-
 From: Kit Gerrits [mailto:[EMAIL PROTECTED]
 Sent: Monday, January 31, 2005 11:44 AM
 To: Salyzyn, Mark
 Cc: linux-scsi@vger.kernel.org
 Subject: RE: Disk errors
 
 Indeed, I had an entire screenful of errors (a few each 
 second) when I came in in the morning...
 The strange thing is, that the drive with the grown error is 
 part of the DATA container (/home and /data), whilst the disk 
 with the rest ( / ) was fine.
 
 You'd expect the error to show  up in /var/log/messages, but 
 it didn't. 
 I think the entire controller gave up as soon as the error popped up.
 
 -
 Is there a way of having the controller detect / handle grown errors?
 Will setting automatic remapping handle this?
 
 Does Anyone know how to read / write mode pages?
 
 
 Thanks all!
 
 Kit
 
  -Oorspronkelijk bericht-
  Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED]
  Verzonden: maandag 31 januari 2005 17:03
  Aan: Kit Gerrits
  Onderwerp: RE: Disk errors
  
  You get tones of I/O error messages from the filesystem driver once 
  the device goes offline. You can check /var/log/messages to 
 find the 
  root cause.
  
  You will need to run the RAID management tools (afacli) to 
 display the 
  underlying components (container list). Dell has their own 
 customized 
  tools for this, I can not comment on their usage.
  
  Sincerely -- Mark Salyzyn
  
 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html