I'm just documenting a problem and a solution in case anyone else
suffers the same problem - it took quite a bit of googling to find the
solution.  Not a lot of info on raid recovery.

One thing about raid1 mirroring, if something goes wrong and a partition
(mirror) gets removed from the raid array, the system continues working
flawlessly.

To find them, you have to look through the output of dmesg (or notice
as the messages flash by during boot), for messages like this:

    md: hde6's event counter: 00000174
    md: hda6's event counter: 000001ae
    md: superblock update time inconsistency -- using the most recent one
    md: freshest: hda6
    md: kicking non-fresh hde6 from array!
    md: unbind<hde6,1>
    md: export_rdev(hde6)
    md: RAID level 1 does not need chunksize! Continuing anyway.
    md0: max total readahead window set to 124k
    md0: 1 data-disks, max readahead per data-disk: 124k
    raid1: device hda6 operational as mirror 0
    raid1: md0, not all disks are operational -- trying to recover array
    raid1: raid set md0 active with 1 out of 2 mirrors
    md: updating md0 RAID superblock on device
    md: hda6 [events: 000001af]<6>(write) hda6's sb offset: 3076352
    md: recovery thread got woken up ...
    md0: no spare disk to reconstruct array! -- continuing in degraded mode

However, there is no other notification of an error occurring.  I
recently accidentally discovered I'd been running my system with all
raid discs in degraded mode.  I.e. I thought I was under the nice safe
raid1 umbrella but I wasn't, and hadn't been for a long time.

There's a note somewhere that says never access your partitions of a
raid directly via the hd name like /dev/hda1 - if you must, use the
ataraid name: /dev/ataraid/d0p1  (disc 0 partition 1).  That's probably
what I'd done to cause the problem.

So, on an old Red Hat 7.2 system, pre- mdadm raid comamnds, how do you
fix it?  Probably you can fetch and install the mdadm package, which
everyone likes better than the old raidtools.  I was feeling paranoid
about doing that, and thought I'd have a go at getting back into action
with the old (raidtools-0.90) commands, which I'd used to build the
array in the first place.

First problem was I couldn't remember the command for rebuilding a raid
device, and a man -k raid listed the raid commands, but none could
rebuild the array.  Eventually I found that there's no man entry in RH
7.2 or 7.3 for the required command, raidhotadd.

Usage: /sbin/raidhotadd  /dev/md0 /dev/hde7

for example.  Seemed to work just fine, and you can cat /proc/mdstat to
watch the progress of the rebuild.  That's all you'd need to do.

.... Unless you're a complete idiot like me, and you add the wrong
partition to the raid array.  Which is what I discovered when I went to
reconstruct the *other* raid array.

    # /sbin/raidhotadd /dev/md2 /dev/hde7
    /dev/md2: can not hot-add disk: invalid argument.
    
Not a catastrophe because at least it was only the partition that had
been kicked out of the other array.  So sure, I'd wiped all the
partition's data, but it was still happily sitting on the one that was
still in the array.

I added the correct partition and so now had three partitions used in
the mirror.  (I didn't even know you could do that.)  So now a cat
/proc/mdstat showed this:

    Personalities : [raid1] 
    read_ahead 1024 sectors
    md0 : active raid1 hde6[2] hde7[1] hda6[0]
          3076352 blocks [2/2] [UU]
          
    md2 : active raid1 hda7[0]
          29567488 blocks [2/1] [U_]
          
    unused devices: <none>
    

So, the problem then became, how to remove a partition (hde7) from the
raid device?  A little search turned up raidhotremove.  Unfortunately,
that won't work on a running device:

    # /sbin/raidhotremove /dev/md0 /dev/hde7
    /dev/md0: can not hot-remove disk: disk busy!

I noticed the raidhotgenerateerror command (usage:   
raidhotgeneraterror /dev/md0 /dev/hde7), which appeared to work but
didn't actually let me hot remove it afterwards.  (Still: device busy!)

Then a google search turned up this exact problem, and the solution
too: mark the partition faulty with the raidsetfaulty command.  I soon
discovered this isn't part of the raidtools-0.90-24 package, but is in
1.00.3.  synaptic quickly showed there was no RH7.2 package available
via apt-get, and rpmfind quickly confirmed this.

Google lead to the name for the file, and with that a google search on
raidtools-1.00.3.tar.gz actually lead to a place holding not just the
..tgz but also a source rpm (http://linux.maruhn.com/sec/raidtools.html)

So I grabbed that, did the rpm --rebuild followed by the rpm -F, and
got the necessary commands.  Then it was a simple matter of:


1.  raidsetfaulty /dev/md0 /dev/hde7
2.  raidhotremove /dev/md0 /dev/hde7
3.  watch cat /proc/mdstat
    (until the reconstruction completed on /dev/md0)

    # cat /proc/mdstat
    Personalities : [raid1] 
    read_ahead 1024 sectors
    md0 : active raid1 hde7[2] hda6[0]
          3076352 blocks [2/1] [U_]
          [====>................]  recovery = 20.4% (629300/3076352) finish=1.2min 
speed=33121K/sec
    md2 : active raid1 hda7[0]
          29567488 blocks [2/1] [U_]
          
    unused devices: <none>
    
4.  raidhotadd /dev/md2 /dev/hde7
    (and wait until the raid reconstruction finishes)

There you are.  Hope this is easier to find if some other poor soul has
the same problem.  Installing mdadm and following similar command steps
would probably also work, I imagine.

luke

-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to