Q1) Is a `Medium Error' almost certainly a HDA HW problem ? Q2) Should SCSI errors go to /var/log/messages as well as dmesg ? Q3) Could RAID problems be reported by email ? Q4) How `should' iffy disks be replaced ? Running RedHat 6.1/2.2.12-20smp on a Gigabyte GA-6BXDU (Adaptec 7896), to access a Trimm UR24JB(Rack) `24 Drive Cube' (GEM312V2) JBoD crate filled with SEAGATE ST118202LC discs, using raidtools 0.90 for RAID1. Far more frquently than I would expect, diska appear to be `failing'. Q1) Is a `Medium Error' almost certainly a HDA HW problem ? Not totally on topic, but I guess people on this list know about SCSI ... Not running the `expected' OS, I want to be extra sure that the problem *IS* with the HDA before sending it back to our suppliers to get a replacement, and not a Linux or cabling problem. As I see it, MEDIUM ERROR CDB: Read (10) 00 0 0 40 48 86 00 00 80 00; Info fld=0x4048fd, Current sd08:24: sense key Medium Err or; Additional sense indicates Unrecovered read error; scsidisk I/O error: dev 08:24, sector 3952 suggests that it is a HW problem on the HDA itself. Is this a safe assumption ? Q2) Should SCSI errors go to /var/log/messages as well as dmesg ? I can see the message using `dmesg' shortly after the incident, but it appears not to be in /var/log/messages. As I remember, when running my own kernel (rather that the RH kernel) it went to /var/log/messages -- is this a kernel difference, or a configurayion problem with syslog etc ? Getting on to linux-raid stuff .... Q3) Could RAID problems be reported by email ? I collect various config and status info from my systems, and visually inspects the diffs, which is not the best way to spot a disk failure. How `should' a disk failure be notified ? Could the kernel arrange (one way or another) that email is generated whenever a RAID disk failure is noticed ? Q4) How `should' iffy disks be replaced ? The normal scenario is for a single Medium Error or Positioning Error to be followed by a slew of them. Power cycling (e.g. removing from the crate and re-inserting) fixes it (even if Linux does not `see' that this has happened). However, not long after, the problem will recur. As such, I want to replace the disk. Unfortunately, `raidhotremove' does not work on a disk which is in use. [[ This was discussed some time ago -- I never really understood it though ]] The best I could come up with was: Remove disk Create new files in all FSs that use that disk Run `sync' Do lots of `raid hot remove's now that the partitions are `(F)' echo "scsi remove-single-device 0 0 2 0" > /proc/scsi/scsi Insert replacement disk echo "scsi add-single-device 0 0 2 0" > /proc/scsi/scsi Do lots of `raid hot add's which seems somewhat more risky that I'd like, and required me to while some perl scripts to do all the `raid hot *'s. The PSB has plenty of space to hold `old' info for `no longer available' disks. Why can't it retain information about which the old didks *used* to be, so that after a disk failure, I can slap in a new disk and tell it `go use what you had before' ? As it is, if I everreboot (even into single user mode) with a disk (or more commonly a SCSI bus) unavailable for some reason, it shoves all the redunant RAID partitions into degraded mode, and I have to do lots of `raid hot add's to put it all back as it was. Sure, it may decide to re-sync the `good' disk to the `suspect' ones, but it should `recognise' them, and re-incorporate them automatically.