I brought this up on the raid list some time ago and got a less
than completely helpful response. I concluded that more information
was needed before I asked the question again.
Problem:
I have an Alpha running Red Hat Linux 6.2 (Kernel 2.2-14) with
two SCSI adapters, an AHA-294X and a sym53c895. The trouble is
associated with the sym53c895. On its LVD bus, there are 4 disks:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: QUANTUM Model: ATLAS 10K 36WLS Rev: UCP0
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 01 Lun: 00
Vendor: QUANTUM Model: ATLAS 10K 36WLS Rev: UCP0
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 02 Lun: 00
Vendor: QUANTUM Model: ATLAS 10K 36WLS Rev: UCP0
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 03 Lun: 00
Vendor: SEAGATE Model: ST336704LW Rev: 0004
Type: Direct-Access ANSI SCSI revision: 03
Each has one 36 GB partition accessible as /dev/sdc1-/dev/sdf1.
The first three have been configured with RAID5 into a 72 GB device
/dev/md0 and initialized with ext2 into a file system. At odd
intervals, but always shortly after 04:03:35 in the morning an
error occurs on sector 71434352 of the disk /dev/sdc1. (See log
extracts later in this text.) /dev/sdc1 is then kicked out of the
RAID5 set until I come in and raidhotremove/raidhotadd it back in.
The reinsertion always succeeds without error.
This brings up two questions. The more important one is:
Why is the device being kicked out of the RAID set (other than
the obvious answer that that is the way the code is written)
without any real attempt at error recovery? At the least, the
read should be retried once, and that does not seem to be happening.
Further, since this is a RAID5 set, the sector can be recovered
from the other members of the set and rewritten on the original
disk. (This happens as part of the normal recovery process and
the indications are that it always succeeds.) This is NOT happening
as a part of the normal recovery process. (There was another
message in the RAID list some time ago that indicated that
writes were not retried either and that they should be.) I
can see that some kinds of error require that a member be removed
immediately from the RAID set, but this is not that kind of error
in my opinion.
The less important question is:
Why is this particular pattern of errors occurring? It is odd in
at least two respects: It happens at the same clock time and is
always the same block. Real disk errors do not usually happen on
such a regular schedule and tend to include more and more different
blocks over time. Also, as mentioned above, the block in question
is being rewritten regularly as part of the RAID set reconstruction.
If it were a real error, the drive would have reassigned the block
and the error would either not recur, or would move around. Since
it is not being reassigned, the drive must not see it as a real
error. So, does anybody have a suggestion about what is really
going on?
Feb 9 04:03:39 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Feb 9 04:03:39 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Feb 9 04:03:39 oscar kernel: Additional sense indicates Unrecovered read error
Feb 9 04:03:39 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Feb 9 04:03:39 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Feb 15 04:03:42 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Feb 15 04:03:42 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Feb 15 04:03:42 oscar kernel: Additional sense indicates Unrecovered read error
Feb 15 04:03:42 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Feb 15 04:03:42 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Feb 16 04:03:39 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Feb 16 04:03:39 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Feb 16 04:03:39 oscar kernel: Additional sense indicates Unrecovered read error
Feb 16 04:03:39 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Feb 16 04:03:39 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Feb 18 04:03:40 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Feb 18 04:03:40 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Feb 18 04:03:40 oscar kernel: Additional sense indicates Unrecovered read error
Feb 18 04:03:40 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Feb 18 04:03:40 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Feb 20 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Feb 20 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Feb 20 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error
Feb 20 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Feb 20 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Feb 22 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Feb 22 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Feb 22 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error
Feb 22 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Feb 22 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Feb 23 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Feb 23 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Feb 23 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error
Feb 23 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Feb 23 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Mar 1 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Mar 1 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Mar 1 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error
Mar 1 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Mar 1 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Mar 3 04:03:36 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Mar 3 04:03:36 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Mar 3 04:03:36 oscar kernel: Additional sense indicates Unrecovered read error
Mar 3 04:03:36 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Mar 3 04:03:36 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Mar 5 04:03:36 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Mar 5 04:03:36 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Mar 5 04:03:36 oscar kernel: Additional sense indicates Unrecovered read error
Mar 5 04:03:36 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Mar 5 04:03:36 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Mar 6 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Mar 6 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Mar 6 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error
Mar 6 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Mar 6 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Mar 7 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Mar 7 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Mar 7 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error
Mar 7 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Mar 7 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
--
Mar 8 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read
(10) 00 04 42 00 90 00 00 08 00
Mar 8 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium
Error
Mar 8 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error
Mar 8 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352
Mar 8 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation
continuing on 2 devices
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]