Some questions relating to recovering from disc failures

Piete Brooks Sun, 23 Jan 2000 17:56:07 -0800
Q1) Is a `Medium Error' almost certainly a HDA HW problem ?
Q2) Should SCSI errors go to /var/log/messages as well as dmesg ?
Q3) Could RAID problems be reported by email ?
Q4) How `should' iffy disks be replaced ?


Running RedHat 6.1/2.2.12-20smp on a Gigabyte GA-6BXDU (Adaptec 7896),
to access a Trimm UR24JB(Rack) `24 Drive Cube' (GEM312V2) JBoD crate
filled with SEAGATE ST118202LC discs, using raidtools 0.90 for RAID1.


Far more frquently than I would expect, diska appear to be `failing'.

Q1) Is a `Medium Error' almost certainly a HDA HW problem ?

Not totally on topic, but I guess people on this list know about SCSI ...

Not running the `expected' OS, I want to be extra sure that the problem
*IS* with the HDA before sending it back to our suppliers to get a
replacement, and not a Linux or cabling problem. As I see it,

        MEDIUM ERROR CDB: Read (10) 00 0 0 40 48 86 00 00 80 00;
        Info fld=0x4048fd, Current sd08:24: sense key Medium Err or;
        Additional sense indicates Unrecovered read error;
        scsidisk I/O error: dev 08:24, sector 3952

suggests that it is a HW problem on the HDA itself.
Is this a safe assumption ?


Q2) Should SCSI errors go to /var/log/messages as well as dmesg ?

I can see the message using `dmesg' shortly after the incident,
but it appears not to be in /var/log/messages.
As I remember, when running my own kernel (rather that the RH kernel)
it went to /var/log/messages -- is this a kernel difference, or a
configurayion problem with syslog etc ?


Getting on to linux-raid stuff ....

Q3) Could RAID problems be reported by email ?

I collect various config and status info from my systems, and visually
inspects the diffs, which is not the best way to spot a disk failure.
How `should' a disk failure be notified ?
Could the kernel arrange (one way or another) that email is generated
whenever a RAID disk failure is noticed ?


Q4) How `should' iffy disks be replaced ?

The normal scenario is for a single Medium Error or Positioning Error
to be followed by a slew of them.
Power cycling (e.g. removing from the crate and re-inserting) fixes it
(even if Linux does not `see' that this has happened).
However, not long after, the problem will recur.
As such, I want to replace the disk.

Unfortunately, `raidhotremove' does not work on a disk which is in use.
[[ This was discussed some time ago -- I never really understood it though ]]
The best I could come up with was:
        Remove disk
        Create new files in all FSs that use that disk
        Run `sync'
        Do lots of `raid hot remove's now that the partitions are `(F)'
        echo "scsi remove-single-device 0 0 2 0" > /proc/scsi/scsi 
        Insert replacement disk
        echo "scsi add-single-device 0 0 2 0" > /proc/scsi/scsi 
        Do lots of `raid hot add's
which seems somewhat more risky that I'd like, and required me to while some
perl scripts to do all the `raid hot *'s.
The PSB has plenty of space to hold `old' info for `no longer available' disks.
Why can't it retain information about which the old didks *used* to be, so that
after a disk failure, I can slap in a new disk and tell it `go use what you had
before' ?
As it is, if I everreboot (even into single user mode) with a disk (or more
commonly a SCSI bus) unavailable for some reason, it shoves all the redunant
RAID partitions into degraded mode, and I have to do lots of `raid hot add's
to put it all back as it was.
Sure, it may decide to re-sync the `good' disk to the `suspect' ones,
but it should `recognise' them, and re-incorporate them automatically.
Some questions relating to recovering from disc failures

Reply via email to