RE: Disk Errors
From: Douglas Gilbert [mailto:[EMAIL PROTECTED] writes: All may not be lost. If a medium error occurs and the ASC and ASCQ imply the sector could be read but failed ECC then the READ LONG SCSI command should fetch the block (plus ECC and other data). For example a Fujitsu MAM3184 returns 576 bytes. It is probably too much to expect that all the damage will be in the last 64 bytes. However, the drive has taken whatever action it could to reconstruct the data, the failure to report the block for a standard read means that the data is in fact `lost'. The data+ECC combination must be in a state where there are more bits of damage than the error correction can deal with; 64 bytes of ECC deals with single bit errors thus we know that we have more than 1 bit of damage to the disk. We could have 4096 bits of damage in the worst case :-) and never know that fact. If I wanted in desperation to recover whatever data I could, this would be grand, but as it stands, from the Linux File System Driver perspective, it would be dangerous to accept this block as anything more than it is. If the data is of the form to permit some loss, for example video, audio content or an error correcting stream of data, someone can make a case where READ_LONG is an appropriate action to take to help fill in missing content. A fun thought ... - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk Errors
I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for the pointer!). Sda is the single-drive volume (non-RAID, as it is only for the O/S, which needs to be speedy and can be pulled from tape easily). This explains several things: A/ Why a single error can take an entire volume offline B/ Why the error is not logged If it only took the partition offline, it would still have been logged, as / is mounted from sda3 And leaves one question: What caused the error? There are no GROWN defects on the drive in this volume --- Reference logs: --- Executing: disk show defects (ID=0) Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0 Executing: container list Num Total Oth Chunk Scsi Partition Label Type Size Ctr Size Usage B:ID:L Offset:Size - -- -- --- -- --- -- - 0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB /dev/sda NT 1RAID-5 16.9GB 32KB Open0:01:0 64.0KB:8.47GB /dev/sdb DATA 0:02:0 64.0KB:8.47GB ?:??:? - Missing - Mount points it to: # /dev/sda5 5.3G 1.5G 3.6G 30% /usr -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: dinsdag 1 februari 2005 4:15 Aan: Kit Gerrits Onderwerp: RE: Disk errors The controller does not appear to be busted; you have a Volume and a RAID-5. Are you missing an Array? A two drive failure on a RAID-5 gives you an offline array. A single drive failure in a Volume gives you an offline array. You need to find who is 08:05, look through /dev for the major/minor number and relate it to the 'device'. Look through /proc/scsi/scsi and /var/messages to help correlate it. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk Errors
Kit Gerrits wrote: I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for the pointer!). Sda is the single-drive volume (non-RAID, as it is only for the O/S, which needs to be speedy and can be pulled from tape easily). This explains several things: A/ Why a single error can take an entire volume offline B/ Why the error is not logged If it only took the partition offline, it would still have been logged, as / is mounted from sda3 And leaves one question: What caused the error? There are no GROWN defects on the drive in this volume Kit, A block/sector is added to the grown defect list after it has been reassigned. Reaasignment occurs automatically for recoverable (medium) errors if the AWRE and/or ARRE bits are set (those bits are in the read write error recovery mode page). So there are two situations in which damaged blocks remain accessible: 1) unrecoverable medium errors 2) recoverable medium errors when AWRE and/or ARRE are clear Case 2) can be ignored ** or could be handled by setting ARRE and then reading the whole disk (e.g. with dd). Both cases can be handled with the REASSIGN BLOCKS SCSI command once the defective logical block address (lba) or addresses have been identified. Using the sg3_utils package various things can be done: - sginfo -e /dev/sda will show the AWRE and ARRE settings. Changing them with sginfo is a bit ugly - sginfo -G /dev/sda will show the grown defect list in index format (up to 3 other formats may be available) - sg_dd if=/dev/sg0 of=/dev/null bs=512 will read the whole disk or fail at the first unrecoverable (medium) error. If a medium error is detected the info field is the lba of the defect. *** - sg_reassign -a lba /dev/sda will reassign the lba block. If this succeeds lba should appear in the grown defect list (sginfo -G -Flogical /dev/sda). When a logical block with unrecoverable errors is reassigned then the new contents are vendor specific. I'm not sure how file systems react to this. ** recoverable errors can be ignored. Assuming these recoverable errors occur on read operations then the read error counter log page's recovered error counter (one of them depending on the duration of the recovery process) will be incremented *** due to error processing, it is still better to use /dev/sg0 rather than than /dev/sda with the sg_dd utility. Recent changes (lk 2.6.11-rc2-bk8) make the following work: sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512 in the presence of errors Doug Gilbert --- Reference logs: --- Executing: disk show defects (ID=0) Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0 Executing: container list Num Total Oth Chunk Scsi Partition Label Type Size Ctr Size Usage B:ID:L Offset:Size - -- -- --- -- --- -- - 0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB /dev/sda NT 1RAID-5 16.9GB 32KB Open0:01:0 64.0KB:8.47GB /dev/sdb DATA 0:02:0 64.0KB:8.47GB ?:??:? - Missing - Mount points it to: # /dev/sda5 5.3G 1.5G 3.6G 30% /usr -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: dinsdag 1 februari 2005 4:15 Aan: Kit Gerrits Onderwerp: RE: Disk errors The controller does not appear to be busted; you have a Volume and a RAID-5. Are you missing an Array? A two drive failure on a RAID-5 gives you an offline array. A single drive failure in a Volume gives you an offline array. You need to find who is 08:05, look through /dev for the major/minor number and relate it to the 'device'. Look through /proc/scsi/scsi and /var/messages to help correlate it. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk Errors
Good information for a single drive on a simple SCSI card. This will not work for drives that are part of an array (volume) as /dev/sda references a pseudo device. Besides, the firmware in the RAID controller takes the actions necessary to perform recoverable bad block remaps. Sincerely -- Mark Salyzyn -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Douglas Gilbert Sent: Tuesday, February 01, 2005 7:44 AM To: Kit Gerrits Cc: linux-scsi@vger.kernel.org Subject: Re: Disk Errors Kit Gerrits wrote: I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for the pointer!). Sda is the single-drive volume (non-RAID, as it is only for the O/S, which needs to be speedy and can be pulled from tape easily). This explains several things: A/ Why a single error can take an entire volume offline B/ Why the error is not logged If it only took the partition offline, it would still have been logged, as / is mounted from sda3 And leaves one question: What caused the error? There are no GROWN defects on the drive in this volume Kit, A block/sector is added to the grown defect list after it has been reassigned. Reaasignment occurs automatically for recoverable (medium) errors if the AWRE and/or ARRE bits are set (those bits are in the read write error recovery mode page). So there are two situations in which damaged blocks remain accessible: 1) unrecoverable medium errors 2) recoverable medium errors when AWRE and/or ARRE are clear Case 2) can be ignored ** or could be handled by setting ARRE and then reading the whole disk (e.g. with dd). Both cases can be handled with the REASSIGN BLOCKS SCSI command once the defective logical block address (lba) or addresses have been identified. Using the sg3_utils package various things can be done: - sginfo -e /dev/sda will show the AWRE and ARRE settings. Changing them with sginfo is a bit ugly - sginfo -G /dev/sda will show the grown defect list in index format (up to 3 other formats may be available) - sg_dd if=/dev/sg0 of=/dev/null bs=512 will read the whole disk or fail at the first unrecoverable (medium) error. If a medium error is detected the info field is the lba of the defect. *** - sg_reassign -a lba /dev/sda will reassign the lba block. If this succeeds lba should appear in the grown defect list (sginfo -G -Flogical /dev/sda). When a logical block with unrecoverable errors is reassigned then the new contents are vendor specific. I'm not sure how file systems react to this. ** recoverable errors can be ignored. Assuming these recoverable errors occur on read operations then the read error counter log page's recovered error counter (one of them depending on the duration of the recovery process) will be incremented *** due to error processing, it is still better to use /dev/sg0 rather than than /dev/sda with the sg_dd utility. Recent changes (lk 2.6.11-rc2-bk8) make the following work: sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512 in the presence of errors Doug Gilbert --- Reference logs: --- Executing: disk show defects (ID=0) Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0 Executing: container list Num Total Oth Chunk Scsi Partition Label Type Size Ctr Size Usage B:ID:L Offset:Size - -- -- --- -- --- -- - 0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB /dev/sda NT 1RAID-5 16.9GB 32KB Open0:01:0 64.0KB:8.47GB /dev/sdb DATA 0:02:0 64.0KB:8.47GB ?:??:? - Missing - Mount points it to: # /dev/sda5 5.3G 1.5G 3.6G 30% /usr -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: dinsdag 1 februari 2005 4:15 Aan: Kit Gerrits Onderwerp: RE: Disk errors The controller does not appear to be busted; you have a Volume and a RAID-5. Are you missing an Array? A two drive failure on a RAID-5 gives you an offline array. A single drive failure in a Volume gives you an offline array. You need to find who is 08:05, look through /dev for the major/minor number and relate it to the 'device'. Look through /proc/scsi/scsi and /var/messages to help correlate it. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED
RE: Disk Errors
Kit, If you have another (non-RAID) SCSI system, you could take the faulty drive there to modify the mode pages to turn on AWRE and ARRE with either sgmode (scsirastools.sf.net) or sginfo (sg3_utils). Otherwise, you are dependent on the tools that are provided for the PowerEdge RAID controller. Andy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Douglas Gilbert Sent: Tuesday, February 01, 2005 7:44 AM To: Kit Gerrits Cc: linux-scsi@vger.kernel.org Subject: Re: Disk Errors Kit Gerrits wrote: I have found 08:05 to correspond to /dev/sda5, mounted as /usr(Thanks for the pointer!). Sda is the single-drive volume (non-RAID, as it is only for the O/S, which needs to be speedy and can be pulled from tape easily). This explains several things: A/ Why a single error can take an entire volume offline B/ Why the error is not logged If it only took the partition offline, it would still have been logged, as / is mounted from sda3 And leaves one question: What caused the error? There are no GROWN defects on the drive in this volume Kit, A block/sector is added to the grown defect list after it has been reassigned. Reaasignment occurs automatically for recoverable (medium) errors if the AWRE and/or ARRE bits are set (those bits are in the read write error recovery mode page). So there are two situations in which damaged blocks remain accessible: 1) unrecoverable medium errors 2) recoverable medium errors when AWRE and/or ARRE are clear Case 2) can be ignored ** or could be handled by setting ARRE and then reading the whole disk (e.g. with dd). Both cases can be handled with the REASSIGN BLOCKS SCSI command once the defective logical block address (lba) or addresses have been identified. Using the sg3_utils package various things can be done: - sginfo -e /dev/sda will show the AWRE and ARRE settings. Changing them with sginfo is a bit ugly - sginfo -G /dev/sda will show the grown defect list in index format (up to 3 other formats may be available) - sg_dd if=/dev/sg0 of=/dev/null bs=512 will read the whole disk or fail at the first unrecoverable (medium) error. If a medium error is detected the info field is the lba of the defect. *** - sg_reassign -a lba /dev/sda will reassign the lba block. If this succeeds lba should appear in the grown defect list (sginfo -G -Flogical /dev/sda). When a logical block with unrecoverable errors is reassigned then the new contents are vendor specific. I'm not sure how file systems react to this. ** recoverable errors can be ignored. Assuming these recoverable errors occur on read operations then the read error counter log page's recovered error counter (one of them depending on the duration of the recovery process) will be incremented *** due to error processing, it is still better to use /dev/sg0 rather than than /dev/sda with the sg_dd utility. Recent changes (lk 2.6.11-rc2-bk8) make the following work: sg_dd if=/dev/sda blk_sgio=1 of=/dev/null bs=512 in the presence of errors Doug Gilbert --- Reference logs: --- Executing: disk show defects (ID=0) Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0 Executing: container list Num Total Oth Chunk Scsi Partition Label Type Size Ctr Size Usage B:ID:L Offset:Size - -- -- --- -- --- -- - 0Volume 8.47GBOpen0:00:0 64.0KB:8.47GB /dev/sda NT 1RAID-5 16.9GB 32KB Open0:01:0 64.0KB:8.47GB /dev/sdb DATA 0:02:0 64.0KB:8.47GB ?:??:? - Missing - Mount points it to: # /dev/sda5 5.3G 1.5G 3.6G 30% /usr -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: dinsdag 1 februari 2005 4:15 Aan: Kit Gerrits Onderwerp: RE: Disk errors The controller does not appear to be busted; you have a Volume and a RAID-5. Are you missing an Array? A two drive failure on a RAID-5 gives you an offline array. A single drive failure in a Volume gives you an offline array. You need to find who is 08:05, look through /dev for the major/minor number and relate it to the 'device'. Look through /proc/scsi/scsi and /var/messages to help correlate it. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED
Re: Disk Errors
So there are two situations in which damaged blocks remain accessible: 1) unrecoverable medium errors ... What's the rationale behind leaving a damaged block accessible in the case of an unrecoverable medium error? A possibility that someone might actually be able to recover the data? - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk Errors
Salyzyn, Mark wrote: An unrecoverable medium error is typically `corrected' when a write to the block occurs. RAID cards will use the redundancy to calculate the data and write it back to the offending drive for instance. Otherwise, for none-redundant stores, bad media is as good as anything to remind one that the data is gone ;- Sincerely -- Mark Salyzyn All may not be lost. If a medium error occurs and the ASC and ASCQ imply the sector could be read but failed ECC then the READ LONG SCSI command should fetch the block (plus ECC and other data). For example a Fujitsu MAM3184 returns 576 bytes. It is probably too much to expect that all the damage will be in the last 64 bytes. As Mark pointed out, if /dev/sda is a virtual disk then it is unlikely that the READ LONG SCSI command will be supported. sg3_utils has a sg_read_long utility. Long blocks can be written to the media with the sg_write_long utility which was introduced mainly for testing (e.g. creating artificial medium errors). BTW I noticed that the block layer reads around a medium error. Say 8 KB is being read and a medium error occurs (and the info field is set to the lba of the first failure) then several small reads are done to reconstruct as much of the original 8 KB as possible (probably with a block of zeroes corresponding to the medium error). Doug Gilbert -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bryan Henderson Sent: Tuesday, February 01, 2005 1:01 PM To: [EMAIL PROTECTED] Cc: Kit Gerrits; linux-scsi@vger.kernel.org Subject: Re: Disk Errors So there are two situations in which damaged blocks remain accessible: 1) unrecoverable medium errors ... What's the rationale behind leaving a damaged block accessible in the case of an unrecoverable medium error? A possibility that someone might actually be able to recover the data? - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk errors
Andrew, Thanks for explaining the initial vs grown error list. Unfortunately, the tool itself monitors softwareRAID and SCSI devices. This means that sgmode itself sees only the containers on the PERC. Would you happen to know how to accomplish this in afacli? AFA0 disk set ? disk set default - Sets the various disk defaults for all subsequent CLI commands. disk set smart - Change a device's SMART configuration. AFA0 disk show ? disk show default - Shows the various defaults set for the CLI commands. disk show defects - Shows the number of defects and/or defect list on a particular disk drive. disk show partition - Shows the partitions on the disks attached to this controller. disk show smart - Displays SMART values and settings for SMART enabled devices. disk show space - Shows space usage on the disks attached to the controller. AFA0 disk show default Executing: disk show default No Default AFA0disk show smart Executing: disk show smart SmartMethod of Enable Capable Informational Exception Performance Error B:ID:L Device Exceptions(MRIE) ControlEnabled Count -- --- - --- -- 0:00:0 Y6 Y N 0 0:01:0 Y6 Y N 0 0:02:0 Y6 Y N 0 0:03:0 Y6 Y N 0 0:06:0 N Thanks for the info Kit -Oorspronkelijk bericht- Van: Cress, Andrew R [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 15:46 Aan: Kit Gerrits; linux-scsi@vger.kernel.org Onderwerp: RE: Disk errors Kit, With the growing size of disk drives, and a more sectors allocated to reserve sectors, the number of defects alone is not a big concern, expecially if they are PRIMARY defects (found at manufacture-time). What would be of concern, is an increase in the number of GROWN defects over a short period of time. Unfortunately, it is quite common for one defect to cause a disk to be replaced, when it could be remapped without the expense and trouble of a field replacement. The automatic remapping of grown defects is a feature of SCSI disks, but may not be configured in the disk's mode pages. The mode pages can be changed without affecting the content of the disk (with the exception of size sector mapping parameters). There are several Linux tools to read/set mode pages, among which is 'sgmode' from http://scsirastools.sf.net. As a guess, it appears that you had a grown defect occur on one of your disks, but the remapping was not set to occur automatically on that disk, so a write never finished. Andy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kit Gerrits Sent: Monday, January 31, 2005 9:28 AM To: linux-scsi@vger.kernel.org Subject: Disk errors Exactly how many errors is a SCSI disk allowed to have? I have a PE2400 with a PERC2/Si with 4x9GB My disks show: AFA0 disk show defects 0 Executing: disk show defects (ID=0) Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0 AFA0 disk show defects 1 Executing: disk show defects (ID=1) Number of PRIMARY defects on drive: 952 Number of GROWN defects on drive: 1 AFA0 disk show defects 2 Executing: disk show defects (ID=2) Number of PRIMARY defects on drive: 2457 Number of GROWN defects on drive: 0 AFA0 disk show defects 3 Executing: disk show defects (ID=3) Number of PRIMARY defects on drive: 2794 Number of GROWN defects on drive: 0 The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) has recently hung with the error: I/O Error Dev 08:05 Sector 529712 I would assume that this error is generated by the harddrive, but shouldn't the controller catch SCSI errors (and relocate sectors automagically)? Thanks in advance, Kit Gerrits [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk errors
Indeed, I had an entire screenful of errors (a few each second) when I came in in the morning... The strange thing is, that the drive with the grown error is part of the DATA container (/home and /data), whilst the disk with the rest ( / ) was fine. You'd expect the error to show up in /var/log/messages, but it didn't. I think the entire controller gave up as soon as the error popped up. - Is there a way of having the controller detect / handle grown errors? Will setting automatic remapping handle this? Does Anyone know how to read / write mode pages? Thanks all! Kit -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 17:03 Aan: Kit Gerrits Onderwerp: RE: Disk errors You get tones of I/O error messages from the filesystem driver once the device goes offline. You can check /var/log/messages to find the root cause. You will need to run the RAID management tools (afacli) to display the underlying components (container list). Dell has their own customized tools for this, I can not comment on their usage. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk errors
I don't know much about agacli. The mode pages do have bits to enable SMART, but that's not what I think you are interested in. However, SMART can generate info events that the OS may not be recognizing. What you are interested is mode page 0x01 to see if AWRE and ARRE are turned on (bits 7 6, 0xC0). The default setting for these may be documented in the disk manual for your drives also, which can be obtained from the vendor web site. Or, the PERC vendor may be able to help get this info. Andy -Original Message- From: Kit Gerrits [mailto:[EMAIL PROTECTED] Sent: Monday, January 31, 2005 10:22 AM To: Cress, Andrew R; linux-scsi@vger.kernel.org Subject: RE: Disk errors Andrew, Thanks for explaining the initial vs grown error list. Unfortunately, the tool itself monitors softwareRAID and SCSI devices. This means that sgmode itself sees only the containers on the PERC. Would you happen to know how to accomplish this in afacli? AFA0 disk set ? disk set default - Sets the various disk defaults for all subsequent CLI commands. disk set smart - Change a device's SMART configuration. AFA0 disk show ? disk show default - Shows the various defaults set for the CLI commands. disk show defects - Shows the number of defects and/or defect list on a particular disk drive. disk show partition - Shows the partitions on the disks attached to this controller. disk show smart - Displays SMART values and settings for SMART enabled devices. disk show space - Shows space usage on the disks attached to the controller. AFA0 disk show default Executing: disk show default No Default AFA0disk show smart Executing: disk show smart SmartMethod of Enable Capable Informational Exception Performance Error B:ID:L Device Exceptions(MRIE) ControlEnabled Count -- --- - --- -- 0:00:0 Y6 Y N 0 0:01:0 Y6 Y N 0 0:02:0 Y6 Y N 0 0:03:0 Y6 Y N 0 0:06:0 N Thanks for the info Kit -Oorspronkelijk bericht- Van: Cress, Andrew R [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 15:46 Aan: Kit Gerrits; linux-scsi@vger.kernel.org Onderwerp: RE: Disk errors Kit, With the growing size of disk drives, and a more sectors allocated to reserve sectors, the number of defects alone is not a big concern, expecially if they are PRIMARY defects (found at manufacture-time). What would be of concern, is an increase in the number of GROWN defects over a short period of time. Unfortunately, it is quite common for one defect to cause a disk to be replaced, when it could be remapped without the expense and trouble of a field replacement. The automatic remapping of grown defects is a feature of SCSI disks, but may not be configured in the disk's mode pages. The mode pages can be changed without affecting the content of the disk (with the exception of size sector mapping parameters). There are several Linux tools to read/set mode pages, among which is 'sgmode' from http://scsirastools.sf.net. As a guess, it appears that you had a grown defect occur on one of your disks, but the remapping was not set to occur automatically on that disk, so a write never finished. Andy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kit Gerrits Sent: Monday, January 31, 2005 9:28 AM To: linux-scsi@vger.kernel.org Subject: Disk errors Exactly how many errors is a SCSI disk allowed to have? I have a PE2400 with a PERC2/Si with 4x9GB My disks show: AFA0 disk show defects 0 Executing: disk show defects (ID=0) Number of PRIMARY defects on drive: 1912 Number of GROWN defects on drive: 0 AFA0 disk show defects 1 Executing: disk show defects (ID=1) Number of PRIMARY defects on drive: 952 Number of GROWN defects on drive: 1 AFA0 disk show defects 2 Executing: disk show defects (ID=2) Number of PRIMARY defects on drive: 2457 Number of GROWN defects on drive: 0 AFA0 disk show defects 3 Executing: disk show defects (ID=3) Number of PRIMARY defects on drive: 2794 Number of GROWN defects on drive: 0 The reason I ask is tha tmy O/S (RedHat Enterprise Linux 3.0) has recently hung with the error: I/O Error Dev 08:05 Sector 529712 I would assume that this error is generated by the harddrive, but shouldn't the controller catch SCSI errors (and relocate sectors automagically)? Thanks in advance, Kit Gerrits [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body
RE: Disk errors
The PERC controller looks after bad block reassignment. Sincerely -- Mark Salyzyn -Original Message- From: Kit Gerrits [mailto:[EMAIL PROTECTED] Sent: Monday, January 31, 2005 11:44 AM To: Salyzyn, Mark Cc: linux-scsi@vger.kernel.org Subject: RE: Disk errors Indeed, I had an entire screenful of errors (a few each second) when I came in in the morning... The strange thing is, that the drive with the grown error is part of the DATA container (/home and /data), whilst the disk with the rest ( / ) was fine. You'd expect the error to show up in /var/log/messages, but it didn't. I think the entire controller gave up as soon as the error popped up. - Is there a way of having the controller detect / handle grown errors? Will setting automatic remapping handle this? Does Anyone know how to read / write mode pages? Thanks all! Kit -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 17:03 Aan: Kit Gerrits Onderwerp: RE: Disk errors You get tones of I/O error messages from the filesystem driver once the device goes offline. You can check /var/log/messages to find the root cause. You will need to run the RAID management tools (afacli) to display the underlying components (container list). Dell has their own customized tools for this, I can not comment on their usage. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk errors
But if the PERC (controller) handles disk errors, what could cause: I/O Error Dev 08:05 Sector 529712 I would assume that this error is generated by the harddrive, but shouldn't the controller catch SCSI errors (and relocate sectors automagically)? Kit SCSI relevant DMESG: scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36 Adaptec aic7880 Ultra SCSI adapter aic7880: Ultra Single Channel A, SCSI Id=7, 16/253 SCBs scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36 Adaptec 2940 Ultra2 SCSI adapter aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs blk: queue d7ab8814, I/O limit 4095Mb (mask 0x) (scsi0:A:5): 20.000MB/s transfers (20.000MHz, offset 15) Vendor: NEC Model: CD-ROM DRIVE:466 Rev: 1.06 Type: CD-ROM ANSI SCSI revision: 02 blk: queue c1fc1e14, I/O limit 4095Mb (mask 0x) (scsi1:A:6): 20.000MB/s transfers (10.000MHz, offset 15, 16bit) Vendor: QUANTUM Model: DLT7000 Rev: 2561 Type: Sequential-Access ANSI SCSI revision: 02 blk: queue c1fc1a14, I/O limit 4095Mb (mask 0x) Red Hat/Adaptec aacraid driver (1.1.2 Jun 29 2004 18:26:27) PCI: Found IRQ 14 for device 00:02.1 AAC0: kernel 2.1.4 build 2939 AAC0: monitor 2.1.4 build 2939 AAC0: bios 2.1.0 build 2939 AAC0: serial 410010d0fafaf001 spurious 8259A interrupt: IRQ7. scsi2 : percraid Vendor: DELL Model: PERCRAID Volume Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 blk: queue c1fc1c14, I/O limit 4095Mb (mask 0x) Vendor: DELL Model: PERCRAID RAID5Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 blk: queue d7ab9e14, I/O limit 4095Mb (mask 0x) Attached scsi removable disk sda at scsi2, channel 0, id 0, lun 0 Attached scsi removable disk sdb at scsi2, channel 0, id 1, lun 0 SCSI device sda: 17771136 512-byte hdwr sectors (9099 MB) sda: Write Protect is off Partition check: sda: sda1 sda2 sda3 sda4 sda5 sda6 SCSI device sdb: 35542272 512-byte hdwr sectors (18198 MB) sdb: Write Protect is off sdb: sdb1 sdb2 -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 19:22 Aan: Kit Gerrits CC: linux-scsi@vger.kernel.org Onderwerp: RE: Disk errors The PERC controller looks after bad block reassignment. Sincerely -- Mark Salyzyn -Original Message- From: Kit Gerrits [mailto:[EMAIL PROTECTED] Sent: Monday, January 31, 2005 11:44 AM To: Salyzyn, Mark Cc: linux-scsi@vger.kernel.org Subject: RE: Disk errors Indeed, I had an entire screenful of errors (a few each second) when I came in in the morning... The strange thing is, that the drive with the grown error is part of the DATA container (/home and /data), whilst the disk with the rest ( / ) was fine. You'd expect the error to show up in /var/log/messages, but it didn't. I think the entire controller gave up as soon as the error popped up. - Is there a way of having the controller detect / handle grown errors? Will setting automatic remapping handle this? Does Anyone know how to read / write mode pages? Thanks all! Kit -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 17:03 Aan: Kit Gerrits Onderwerp: RE: Disk errors You get tones of I/O error messages from the filesystem driver once the device goes offline. You can check /var/log/messages to find the root cause. You will need to run the RAID management tools (afacli) to display the underlying components (container list). Dell has their own customized tools for this, I can not comment on their usage. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk errors
On Tue, Feb 01, 2005 at 12:41:13AM +0100, Kit Gerrits wrote: But if the PERC (controller) handles disk errors, what could cause: I/O Error Dev 08:05 Sector 529712 I would assume that this error is generated by the harddrive, but shouldn't the controller catch SCSI errors (and relocate sectors automagically)? In this case, the RAID controller is reporting the I/O error. It may be that you've got bad sectors on more than one physical disk, in the same stripe, and the RAID controller can't fix them. Thanks, Matt -- Matt Domsch Software Architect Dell Linux Solutions linux.dell.com www.dell.com/linux Linux on Dell mailing lists @ http://lists.us.dell.com - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk errors
Maybe you have a failed disk, and another has bad blocks. So, no good copy of the data exists. Just a guess!!! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kit Gerrits Sent: Monday, January 31, 2005 6:41 PM To: 'Salyzyn, Mark' Cc: linux-scsi@vger.kernel.org Subject: RE: Disk errors But if the PERC (controller) handles disk errors, what could cause: I/O Error Dev 08:05 Sector 529712 I would assume that this error is generated by the harddrive, but shouldn't the controller catch SCSI errors (and relocate sectors automagically)? Kit SCSI relevant DMESG: scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36 Adaptec aic7880 Ultra SCSI adapter aic7880: Ultra Single Channel A, SCSI Id=7, 16/253 SCBs scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36 Adaptec 2940 Ultra2 SCSI adapter aic7890/91: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs blk: queue d7ab8814, I/O limit 4095Mb (mask 0x) (scsi0:A:5): 20.000MB/s transfers (20.000MHz, offset 15) Vendor: NEC Model: CD-ROM DRIVE:466 Rev: 1.06 Type: CD-ROM ANSI SCSI revision: 02 blk: queue c1fc1e14, I/O limit 4095Mb (mask 0x) (scsi1:A:6): 20.000MB/s transfers (10.000MHz, offset 15, 16bit) Vendor: QUANTUM Model: DLT7000 Rev: 2561 Type: Sequential-Access ANSI SCSI revision: 02 blk: queue c1fc1a14, I/O limit 4095Mb (mask 0x) Red Hat/Adaptec aacraid driver (1.1.2 Jun 29 2004 18:26:27) PCI: Found IRQ 14 for device 00:02.1 AAC0: kernel 2.1.4 build 2939 AAC0: monitor 2.1.4 build 2939 AAC0: bios 2.1.0 build 2939 AAC0: serial 410010d0fafaf001 spurious 8259A interrupt: IRQ7. scsi2 : percraid Vendor: DELL Model: PERCRAID Volume Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 blk: queue c1fc1c14, I/O limit 4095Mb (mask 0x) Vendor: DELL Model: PERCRAID RAID5Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 blk: queue d7ab9e14, I/O limit 4095Mb (mask 0x) Attached scsi removable disk sda at scsi2, channel 0, id 0, lun 0 Attached scsi removable disk sdb at scsi2, channel 0, id 1, lun 0 SCSI device sda: 17771136 512-byte hdwr sectors (9099 MB) sda: Write Protect is off Partition check: sda: sda1 sda2 sda3 sda4 sda5 sda6 SCSI device sdb: 35542272 512-byte hdwr sectors (18198 MB) sdb: Write Protect is off sdb: sdb1 sdb2 -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 19:22 Aan: Kit Gerrits CC: linux-scsi@vger.kernel.org Onderwerp: RE: Disk errors The PERC controller looks after bad block reassignment. Sincerely -- Mark Salyzyn -Original Message- From: Kit Gerrits [mailto:[EMAIL PROTECTED] Sent: Monday, January 31, 2005 11:44 AM To: Salyzyn, Mark Cc: linux-scsi@vger.kernel.org Subject: RE: Disk errors Indeed, I had an entire screenful of errors (a few each second) when I came in in the morning... The strange thing is, that the drive with the grown error is part of the DATA container (/home and /data), whilst the disk with the rest ( / ) was fine. You'd expect the error to show up in /var/log/messages, but it didn't. I think the entire controller gave up as soon as the error popped up. - Is there a way of having the controller detect / handle grown errors? Will setting automatic remapping handle this? Does Anyone know how to read / write mode pages? Thanks all! Kit -Oorspronkelijk bericht- Van: Salyzyn, Mark [mailto:[EMAIL PROTECTED] Verzonden: maandag 31 januari 2005 17:03 Aan: Kit Gerrits Onderwerp: RE: Disk errors You get tones of I/O error messages from the filesystem driver once the device goes offline. You can check /var/log/messages to find the root cause. You will need to run the RAID management tools (afacli) to display the underlying components (container list). Dell has their own customized tools for this, I can not comment on their usage. Sincerely -- Mark Salyzyn - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html