At 17:07 06.04.00, you wrote:
>In this foray into software raid I created two new devices. /dev/md0 was
>created to hold / and /dev/md1 is for /boot. I created both with the
>"failed-disk" option with one immediatly following the other. Then I
>copied my whole system over to the new devices. I left it in failed mode,
>fixed /etc/fstab and /etc/lilo.conf, ran lilo (worked fine this time with
>a failed disk in the array), and rebooted.
The "mkraid in failed mode" bug and the "raid handling of remove/add/fail
disk" problem cancel each other out in this state.
mkraid in failed modes sets the number of disks 1 too high; normal raid
code in degraded mode sets it 1 too low; -> create degraded & array in
degraded mode gives the correct number of disks. only when you hotadd the
original disk do you get a bad count.
>When the system came up on the
>raid devices properly I fixed up the partitions on the "failed" drive and
>did raidhotadd for both, it resync and I left it running. It was under
>this "step" that a day later I tried to run lilo (now with both drives in
>normal raid functioning) and it failed. So I guess perhaps that is the
>source of the problem unless I'm misunderstanding you? It didn't occur to
>me that anything would be different before and after raidhotadding
>(especially after messages on this list saying it worked fine).
It does work fine. the only place where you'll ever see the error is with
external programs trying to work on the unterlying physical devices of a
raid array; the only known program where the problem surfaces is lilo.
>So with the fix for A) I would have to end up backing up my data and
>recreating the raid-1 array anyway? I'm not opposed to doing this (in
>fact, I was prepared for it if the "failed-disk" method didn't work.
I'd just save the contents of /boot somewhere and rebuild that device
(without dailed-disk stuff), then restore the copy; no patches needed for this.
>If you think this will solve my problem could I get the patch(es) from you
>and try it?
You can have a look at the state of your md devices using mkraid /dev/md0
--debug. If you look at the logs, you'll see that in the raid superblock,
ND (Number of disks in the array) is 3 instead of 2 for your arrays. Lilo
tries to get the physical data for each of the disks and dies a horrible
death when it tries to access the (non-existing) 3rd disk.
You could use the patch from
ftp://ftp.sime.com/pub/linux/raidtools-19990824-0.90.mabene.gz to patch
your raidtools.
You could put your array(s) in the normal state (correct count of disks) by
redoing the raid creation stuff after installing the fixed mkraid
executable. Just mount the original /dev/hdaxx partitions again, stop the
raid arrays and recreate them in failed mode, copy, raidhotadd... same
procedure as before.
>So would the "last disk" be /dev/md3 because it's the last md device or
>/dev/md(0|1) since it has the last physical disk?
Last disk referes to the last physical disk in each raid array.
Final point: I'm not too sure about your lilo.conf file - I don't have any
of the
bios=0x81
stuff in my config file; here's my config file for a working /boot on raid1
lilo setup:
---------------------------------
prompt
timeout = 50
vga = normal
boot=/dev/md4
# End LILO global section
# Linux bootable partition config begins
image = /boot/vmlinuz
root = /dev/md0
label = Linux
read-only
---------------------------------
check that the partition you put under boot= is the one containing the
/boot filesystem, NOT your / filesystem. Works nicely for me, lilo cycles
over the physical disks and makes each of them bootable.
Bye, Martin
"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: [EMAIL PROTECTED]
8010 Graz, Austria
--------------------------------------------------
finger [EMAIL PROTECTED] for PGP public key