two disk failure, proposed howto changes

rvt Sat, 24 Jun 2000 12:59:57 -0700
Hello Jakob,

I recently had a two disk failure, and found your howto as well as a message
from Martin Bene very helpful in resolving this. Below you find a modified
version of chapter 6.1 of your faq, where I merged your and Martin´s version
in order to make things somewhat more detailed and explicit.

The new version may make the procedure for recovery more clear for people
experiencing the problem. And in that situation, they of course are happy
about all help they can get...

The two-disk-failed situation seems to happen relatively often (due to
controller/hardware failure/hickups), and this is where raid is effectively
more dangerous than non-raid (one large disk). A more automated and fool-proof
tool for resolving this might be the ideal solution (but more than I can
deliver currently).

If someone on the mailing list finds some mistake (am I really right about the
spare-disk?) or has an improved version, please post!

========= my proposed howto version:

6.1 Recovery from a multiple disk failure

The scenario is:

A controller dies and takes two disks offline at the same time,
All disks on one scsi bus can no longer be reached if a disk dies,
A cable comes loose...
In short: quite often you get a temporary failure of several disks at once;
afterwards the RAID superblocks are out of sync and you can no longer init
your RAID array.
One thing left: rewrite the RAID superblocks by mkraid --force

To get this to work, you'll need to have an up to date /etc/raidtab - if it
doesn't EXACTLY match devices and ordering of the original disks this won't
work.

Look at the syslog produced by trying to start the array, you'll see the event
count for each superblock. Usually it's best to leave out the disk with the
lowest event count, i.e the one that failed first (by using "failed-disk").

It´s important that you replace "raid-disk" by "failed-disk" for that drive in
your raidtab. If you mkraid without that "failed-disk"-change, the recovery
thread will kick in immediately and start rebuilding the parity blocks. If you
got something wrong this will definitely kill your data. So, you mark one disk
as failed and create the array in degraded mode (the kernel won´t try to
recover/resync the array then).

With "failed-disk" you can specify exactly which disks you want to be active
and perhaps try different combinations for best results. BTW, only mount the
filesystem read-only while trying this out...

If you have a spare-disk, you should mark that as "failed-disk", too.

* Check your raidtab against the info you get in the logs from the failed
startup (correct sequence of partitions).
* mark one of the disks with the lowest event count as a "failed-disk"
instead of "raid-disk" in /etc/raidtab
* recreate the raid superblocks using mkraid
* try to mount readonly, check if all is OK
* if it doesn't work, recheck raidtab, perhaps mark a different drive as
failed, go back to the mkraid step.
* unmount, so you can fsck your raid drive (which you probably want to do)
* add the last disk using raidhotadd
* mount normally
* remove the failed-disk stuff from your raidtab.

========= your original version at
http://www.ostenfeld.dk/~jakob/Software-RAID.HOWTO/

6.1 Recovery from a multiple disk failure

The scenario is:

A controller dies and takes two disks offline at the same time,
All disks on one scsi bus can no longer be reached if a disk dies,
A cable comes loose...
In short: quite often you get a temporary failure of several disks at once;
afterwards the RAID superblocks are out of sync and you can no longer init
your RAID array.
One thing left: rewrite the RAID superblocks by mkraid --force

To get this to work, you'll need to have an up to date /etc/raidtab - if it
doesn't EXACTLY match devices and ordering of the original disks this won't
work.

Look at the sylog produced by trying to start the array, you'll see the event
count for each superblock; usually it's best to leave out the disk with the
lowest event count, i.e the oldest one.

If you mkraid without failed-disk, the recovery thread will kick in
immediately and start rebuilding the parity blocks - not necessarily what you
want at that moment.

With failed-disk you can specify exactly which disks you want to be active and
perhaps try different combinations for best results. BTW, only mount the
filesystem read-only while trying this out... This has been successfully used
by at least two guys I've been in contact with.
two disk failure, proposed howto changes

Reply via email to