Exec summary: will raid1 code cause read error sectors to be re-vectored ?


Full gory details:

I am having problems with some new disks, in that sectors are going bad.
I kind of didn't expect this on `new' disks, but maybe I'm just not used to 
having so many sectors !

So far, the problems have always been in `unused' areas of disk, so every once 
in a while I read all my disks, and process the errors I spot.
I have a perl script which seeks to the suspect sector on the other RAID1 disk,
reads the data it has, then seeks to the sector on the suspect disk, tries a 
read, and if that fails, writes the data from the other disk.

If this were to happen on a live FS, what would happen ?

1) the RAID1 code would spot the read error, read the data from the other disk,
   and write the data back. It would then re-read it to check all was well, and
   then report that there had been a read error which was fixed by re-writing.

2) the RAID1 code would spot the read error, remove that partition from the
   RAID1 array, and go into degraded mode.

3) the RAID1 code would not cope, the md device would report a failure,
   and the whole system would grind to a halt.


Also: is there any way of getting the actual *DISK* to report which blocks it
      has revectored ?


The report was:
        scsi0: MEDIUM ERROR on channel 0, id 5, lun 0, CDB:
                Read (10) 00 00 7d a6 c0 00 00 80 00 
        Current error sd08:50: sense key Medium Error 
        Additional sense indicates Unrecovered read error 
        scsidisk I/O error: dev 08:50, sector 8234718
I note that "7d a6 c0" maps to (nearly) 8234718, but what are the "00 00 80 
00" bits ?
Is "(10)" "Read" ?


Another disk works fine for a while, then reports:
        SCSI disk error : host 1 channel 0 id 3 lun 0 return code = 28000002
        [valid=0] Info fld=0xbe7e, Current sd08:30: sense key Hardware Error
        Additional sense indicates Mechanical positioning error
        scsidisk I/O error: dev 08:30, sector 48766
until I power cycle it, at which point it works just fine.
Is this a common failure mode ?

Reply via email to