Re: How to recover a RAID5 device with more than one disk down

Tom Livingston Fri, 7 May 1999 00:24:59 -0700
Giulio Botto wrote:
> We have a similar problem on a 3way RAID5 with three different event
> counters on the three discs. We *KNOW* two of the are synced because
> power went off in seconds between the last two discs (sdc1 and sdb1),
> and we also know that sda1 is not synced because the logs show it
> went down some 4 hours before the rest happened and the RAID went on
> in degraded mode.

THERE IS A WAY TO FIX THIS!

I've been in a similar situation. Piete Brooks helped me see the light when I
was fumbling around trying to write a tool that would reconstruct superblocks
to not include data saying that two of my disks had failed.  My problem was in
some ways worse, as I had replaced a disk that had gone bad, and while the new
disk was reconstructing, it encountered bad blocks on another device and
kicked it out of the array.  The result? 10 disk set, two disks down, a big
no-no with RAID 5.

I wanted to try to recover whatever I might be able to off that new bad disk..
I suspected that the bad sectors were in a place without data, as we had not
encountered them before, and we were only roughly 1/3rd full.

The answer lies in mkraid.  I had always though of mkraid just like mke2fs.
If you run mke2fs on an existing ext2 file system, it will erase the previous
contents even if you make all the settings exactly the same.  That's what it's
supposed to do.

Mkraid is a little more like fdisk writing partition tables.  Your partition
table can be corrupted, but as long as you haven't written any data to the
drive, you may be able to recover.  If you make sure to re-make the exact same
partitions, with the exact same lengths starting at the exact same blocks,
you'll be able to access your disk again.  But it's one big if.

You can rebuild a two or three disk down RAID5 array in the same way.  IF YOU
ARE GOING TO DO THIS, YOU MUST BE VERY VERY CAREFUL.  The method I'll describe
for doing this may have a _minute_ tolerance for error, but you can't rely on
that.  Just be very, very careful.

You'll need Martin Bene's failed disk patch for this to work correctly.  He
posted it to the list on March 23rd.  This one applies to 2.2.6 or 2.0.36.
There was an older one that I had applied to 2.2.3 as well.  I don't know how
well it would apply to other kernels.

You'll have to rebuild your kernel and raidtools (make sure to make install
these) and get the failed-disk tag working before you proceed.

VERY IMPORTANT:  Have you ever swapped drives, controllers, moved drive id's
or DONE ANYTHING that might change the drive letter that one of your raid
disks is on?  If you did, you have to be VERY VERY careful to re-synch your
/etc/raidtab file with the current physical disk locations.

Luckily, the kernel helps you out here.  When you boot (or when you type
startraid /dev/mdX if you don't auto-start raid in the boot process by using
part type 0xfd) you will see a printout of the exact raid disk numbers that
correspond with the physical disk names.  It will happen when it's trying to
start the array.  It looks like this:

May  6 18:31:44 music kernel: raid5: device hdd1 operational as raid disk 7
May  6 18:31:44 music kernel: raid5: device hdj1 operational as raid disk 8
[...]
May  6 18:31:44 music kernel: raid5: device hdg1 operational as raid disk 0
May  6 18:31:44 music kernel: raid5: not enough operational devices for md0
(2/10 fail
ed)

(see /var/log/messages)

It will, unfortunately, not tell you the raid disk #'s of the failed disks.
You either have to know which failed disks correspond to the raid disk #'s, or
you have to use some tool would print out super block information for all
disks, failed or not.  I'm not sure what this would be.  You can probably use
a little detective work/memory and figure out which failed disk is which raid
disk.

OK, so now you need to edit your /etc/raidtab and compare the physical
disk::raid disk mapping between what's in /etc/raidtab and what the kernel
spit out.  If your disks have moved around for whatever reason, you have to
make sure that you change /etc/raidtab to match what the kernel says about
which raid disk is which.  It's very, very important that you get this right.
If you don't your file system won't be there!

NOTE: You aren't done yet with /etc/raidtab.  This is when the failed-disk
comes in.  The way raid5 works, more than likely you have one disk that's the
most out of date.  In this posters case, one disk failed hours before two
others went off line.  When the first disk failed things kept going, but when
the two others failed the array got kick off-line.  So the two most recent
failures are basically "good" (though they may be slightly out of synch and
cause minor file system corruption).  In my case, I was rebuilding an array
with a spare when another drive went offline.  In this case, my spare disk is
"the most out of date" because it didn't ever get a full picture of the raid
array.

Edit /etc/raidtab, find the entry for the disk that is the most out of date,
and replace the raid-disk flag on the following line to failed-disk.  This
will keep the bad disk out of the array when you start it up again, which is
good...  because you can't trust the data on this one disk (but presumably you
can basically trust the others)

Now that you have your new /etc/raidtab file, check the raid-disk numbers
against the kernel output one more time.  make sure they all match up.  You're
at the now-or-never point in the process.

run 'mkraid -f /dev/mdX'.  It will tell you that it's going to destroy all
your data, but if you've done this 100% right, it will actually save your
data.  Ironic, eh?  If you loose your nerve, hit ^C now.

Once that's done, check the raid status by typing 'cat /proc/mdstat'.  It
should tell you that you are running in degraded mode.  It shouldn't be
rebuilding. It should show one disk down, though, the disk that you entered as
a failed-disk

now mount your array read-only.  If everything has gone well, your filesystem
will be there.  If it hasn't, mount may complain that you have to tell it the
fs type, or there will be massive corruption in your filesystem.  I hope
things went well for you!

If things didn't go right and you can't mount your filesystem, you have
another chance.  Under no circumstances should you raidhotadd your other
drive, or reboot (which will take the drive in as a spare, I believe).
Retrace your steps, figure out what went wrong, make double extra sure this
time that your /etc/raidtab maps correctly to your physical disks, and repeat
the above process.

If things did go well, congratulations!  You're an uber-raid-hacker now!
Prudent people would back up their data right now if they had anything that
they were worried about loosing.  If you had a fresh backup, I don't think you
would have just gone through all this ;)

Then once you're backed up (or have decided to wave your hands in the air like
you just don't care) raidhotadd the disk marked failed (provided, of course,
that it didn't truly fail... like Giulio's power failure problems.  It should
begin the reconstruction, and you should be back on your way to up!

I hope this helps someone.  Thanks go out to Piete and Martin for providing
tools and advice for when I had to do this.  My users are in debt to you (and
so am I!)

Take care,

Tom
Re: How to recover a RAID5 device with more than one disk down

Reply via email to