Recovering from a lost disk

Marc Haber Sat, 23 Oct 1999 11:32:24 -0700
Hi!

I am almost finished with my RAID experiments before jumping into
really using the RAID patches with RAID level 5. Today, I was playing
with recovery procedures for disks getting bad.

My test system has three Quantum Atlas I (2 GB each, sda, ID=1, sdb,
ID=2 and sdc, ID=3) on a single Fast SCSI Host Adapter with NCR
Chipset. The fourth disk is a 4 GB seagate drive (sdd, ID=4) which
holds the system itself and a 2 GB partition that as a spare disk for
the RAID array. sdb and sdc have switches in their power supply leads
to simulate their failure. My /etc/raidtab is pulled in at the end of
this article

First thing was switching sdb off during RAID usage. The system
hesitated for a few seconds and threw about ten screenfuls of messages
and then continued working. The spare disk was pulled in and the array
was reconstructed. After the reconstruction finished, I took the
system down and simulated replacement of the "faulty" sdb disk by
restoring its power. The system, however, continued to run on the two
untouched and the spare disk. I have not been able to get the system
back to use the "replaced" disk again. Additionally, by that
operation, my /etc/raidtab got out of sync with the persistent
superblock.

I then proceeded to switch off sdc. This time, the system became
unusable. It still responded to pings over the network, but my shell
became inresponsive, I couldn't log in any more and new telnet
connections were eventually refused. The system console showed
zillions of SCSI resets and read errors on ID 3 (which happened to be
the "failed" sdc). That state remained for about two hours before I
cut power. I'd have expected the kernel to notice that sdc is dead and
continue operation on the disks left.

This time, I left sdc off when the system rebooted. It came up alright
and run the array in degraded mode on two disks. At least, the data
was still available. After an orderly shutdown, I "replaced" sdc by
restoring its power.

After booting again, the RAID still was in degraded mode. In that
moment, I had all four disks available with the RAID running on only
two of them. I would have expected the system to notice the "replaced"
disks and to pull them in again into the array, having it running with
redundancy and spare.

Now my Questions:

(1)
How do I tell the system that a failed disk has been replaced? If a
spare disk has been started to be used, can I instruct the system to
go back to the original disk and to set the spare free again? How do I
have the system reconstruct the array on a replaced disk? In the
optimal case, I'd have expected the system to notice that the failed
disk is back and partitioned as expected, to reonstruct on the "new"
disk and to set the spare disk free (hey, some people are really
stingy and use slower disks from the spare parts rack as spare disks
in RAID installations).

(2)
What happened on the second disk "failure" when the system became
unuseable? Isn't it RAID's purpose to keep such things from happening?
I'd have expected the kernel to notice that the disk is dead for good
and to stop trying to access it over and over. It worked the first
time!

(3)
How do I recover from an /etc/raidtab that has gotten out of sync with
the persistent superblock?

(4)
Is there a way to get the system to create a /etc/raidtab that is in
sync with what is in the persistent superblock?

(5)
During my tests, I found that the RAID patches are extremely verbose
on the system console even when the commands are given over a telnet
connection. klogd is running, and I find it OK to have the messages
going into syslog, but why are they being reported to the console too?

Any hints will be appreciated, thanks in advance.

Greetings
Marc




/etc/raidtab:
raiddev                 /dev/md0
raid-level              5
nr-raid-disks           3
nr-spare-disks          1
persistent-superblock   1
chunk-size              32

parity-algorithm        left-symmetric

device                  /dev/sda5
raid-disk               0

device                  /dev/sdb5
raid-disk               1

device                  /dev/sdc6
raid-disk               2

device                  /dev/sdd6
spare-disk              0

-- 
-------------------------------------- !! No courtesy copies, please !! -----
Marc Haber          |   " Questions are the         | Mailadresse im Header
Karlsruhe, Germany  |     Beginning of Wisdom "     | Fon: *49 721 966 32 15
Nordisch by Nature  | Lt. Worf, TNG "Rightful Heir" | Fax: *49 721 966 31 29
Recovering from a lost disk

Reply via email to