Repeated errors in RAID5 set.

Max TenEyck Woodbury Thu, 08 Mar 2001 11:33:50 -0800
I brought this up on the raid list some time ago and got a less
than completely helpful response. I concluded that more information
was needed before I asked the question again.

Problem:

I have an Alpha running Red Hat Linux 6.2 (Kernel 2.2-14) with
two SCSI adapters, an AHA-294X and a sym53c895. The trouble is
associated with the sym53c895. On its LVD bus, there are 4 disks:

Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: QUANTUM  Model: ATLAS 10K 36WLS  Rev: UCP0
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 01 Lun: 00
  Vendor: QUANTUM  Model: ATLAS 10K 36WLS  Rev: UCP0
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 02 Lun: 00
  Vendor: QUANTUM  Model: ATLAS 10K 36WLS  Rev: UCP0
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi1 Channel: 00 Id: 03 Lun: 00
  Vendor: SEAGATE  Model: ST336704LW       Rev: 0004
  Type:   Direct-Access                    ANSI SCSI revision: 03

Each has one 36 GB partition accessible as /dev/sdc1-/dev/sdf1.
The first three have been configured with RAID5 into a 72 GB device
/dev/md0 and initialized with ext2 into a file system. At odd 
intervals, but always shortly after 04:03:35 in the morning an 
error occurs on sector 71434352 of the disk /dev/sdc1. (See log 
extracts later in this text.) /dev/sdc1 is then kicked out of the 
RAID5 set until I come in and raidhotremove/raidhotadd it back in. 
The reinsertion always succeeds without error.

This brings up two questions. The more important one is:

Why is the device being kicked out of the RAID set (other than
the obvious answer that that is the way the code is written)
without any real attempt at error recovery? At the least, the 
read should be retried once, and that does not seem to be happening. 
Further, since this is a RAID5 set, the sector can be recovered 
from the other members of the set and rewritten on the original 
disk. (This happens as part of the normal recovery process and 
the indications are that it always succeeds.) This is NOT happening 
as a part of the normal recovery process. (There was another 
message in the RAID list some time ago that indicated that 
writes were not retried either and that they should be.) I
can see that some kinds of error require that a member be removed
immediately from the RAID set, but this is not that kind of error
in my opinion.

The less important question is:

Why is this particular pattern of errors occurring? It is odd in
at least two respects: It happens at the same clock time and is
always the same block. Real disk errors do not usually happen on
such a regular schedule and tend to include more and more different
blocks over time. Also, as mentioned above, the block in question 
is being rewritten regularly as part of the RAID set reconstruction. 
If it were a real error, the drive would have reassigned the block 
and the error would either not recur, or would move around. Since 
it is not being reassigned, the drive must not see it as a real 
error. So, does anybody have a suggestion about what is really 
going on?

Feb  9 04:03:39 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Feb  9 04:03:39 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Feb  9 04:03:39 oscar kernel: Additional sense indicates Unrecovered read error 
Feb  9 04:03:39 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Feb  9 04:03:39 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Feb 15 04:03:42 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Feb 15 04:03:42 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Feb 15 04:03:42 oscar kernel: Additional sense indicates Unrecovered read error 
Feb 15 04:03:42 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Feb 15 04:03:42 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Feb 16 04:03:39 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Feb 16 04:03:39 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Feb 16 04:03:39 oscar kernel: Additional sense indicates Unrecovered read error 
Feb 16 04:03:39 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Feb 16 04:03:39 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Feb 18 04:03:40 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Feb 18 04:03:40 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Feb 18 04:03:40 oscar kernel: Additional sense indicates Unrecovered read error 
Feb 18 04:03:40 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Feb 18 04:03:40 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Feb 20 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Feb 20 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Feb 20 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error 
Feb 20 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Feb 20 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Feb 22 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Feb 22 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Feb 22 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
Feb 22 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Feb 22 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Feb 23 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Feb 23 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Feb 23 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
Feb 23 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Feb 23 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Mar  1 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Mar  1 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Mar  1 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error 
Mar  1 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Mar  1 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Mar  3 04:03:36 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Mar  3 04:03:36 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Mar  3 04:03:36 oscar kernel: Additional sense indicates Unrecovered read error 
Mar  3 04:03:36 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Mar  3 04:03:36 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Mar  5 04:03:36 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Mar  5 04:03:36 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Mar  5 04:03:36 oscar kernel: Additional sense indicates Unrecovered read error 
Mar  5 04:03:36 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Mar  5 04:03:36 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Mar  6 04:03:38 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Mar  6 04:03:38 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Mar  6 04:03:38 oscar kernel: Additional sense indicates Unrecovered read error 
Mar  6 04:03:38 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Mar  6 04:03:38 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Mar  7 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Mar  7 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Mar  7 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
Mar  7 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Mar  7 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices 
--
Mar  8 04:03:37 oscar kernel: scsi1: MEDIUM ERROR on channel 0, id 0, lun 0, CDB: Read 
(10) 00 04 42 00 90 00 00 08 00  
Mar  8 04:03:37 oscar kernel: Info fld=0x4420097, Current sd08:21: sense key Medium 
Error 
Mar  8 04:03:37 oscar kernel: Additional sense indicates Unrecovered read error 
Mar  8 04:03:37 oscar kernel: scsidisk I/O error: dev 08:21, sector 71434352 
Mar  8 04:03:37 oscar kernel: raid5: Disk failure on sdc1, disabling device. Operation 
continuing on 2 devices
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
Repeated errors in RAID5 set.

Reply via email to