You have just managed to make me very, very nervous, since I, too, run a
RAID5 in FC4. (/me runs off to backup his RAID to DVD.)
A thought: are you using the old mdadm.conf from the original build
(FC4)? It might be better to let mdadm have a shot at auto-detecting the
degraded array, if you are.
Additionally, try to upgrade your kernel and tools using yum - could be
the bug is fixed in a new kernel, especially if Knoppix works. I suspect
"new kernel doesn't allow RAID 5 recovery" might be a high priority bug.
-DMZ
Ray Chen wrote:
Hello,
I've got a problem with the software raid I've been running for the
past few months, and was hoping you guys could help me out. Sorry for
the long e-mail, but there's a small bit of history that people will
probably ask me for anyway.
One disk of my raid5 went bad, and after replacing it with a new one,
I requested mdadm to rebuild the device. This was on an FC4 machine.
The machine ended up freezing, and I was forced to pull the plug.
This left me with a dirty, degraded array. And, even though I have a
dedicated system disk, my machine wouldn't boot because the kernel
would auto-detect the array, and would halt again before I could get a
shell. Even with the raid=noautodetect kernel option.
So, I booted with a Knoppix 5.0.1 CD. I was able to rebuild the
array, and successfully run fsck on the device with no problems. I
figured FC4 would work normally. Nada.
Soon after mounting the devices, my system would freeze again. This
was confusing because Knoppix was able to use the device just fine. I
eventually decided to do a fresh install of FC5. I booted with the
Knoppix CD, mounted the raid, and backed up my system disk onto the raid.
Installing FC5 was no help. Any process that accessed the mounted
array would become unresponsive. Eventually, I decided that it was
because of the following line from "/bin/ps":
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1329 0.0 0.0 0 0 ? D< Jun19 0:00 [md0_raid5]
I assume that the high-priority uninterruptible sleep state is a bad
thing. After scouring google for some help, I found this:
echo t > /proc/sysrq-trigger
And, after investigating the output, this looked suspect:
md0_raid5 D E2E06800 3260 1329 11 1345 421 (L-TLB)
f7384ed4 00000011 00000003 e2e06800 003d1ec8 00000001 0000000a f7d60258
f7d60130 f7f480b0 c1a0b160 e2e06800 003d1ec8 00000000 00000000 f7e127c0
c01490ed 00000020 f7f480b0 00000000 c1a0bac0 00000002 f7384efc c01347a4
Call Trace:
[<c01490ed>] mempool_alloc+0x37/0xd3
[<c01347a4>] prepare_to_wait+0x12/0x4c
[<c0289349>] md_super_wait+0xa8/0xbd
[<c0134693>] autoremove_wake_function+0x0/0x2d
[<c0289ca5>] md_update_sb+0x107/0x159
[<c028c088>] md_check_recovery+0x161/0x3c3
[<f8a40b8c>] raid5d+0x10/0x113 [raid5]
[<c02f2267>] _spin_lock_irqsave+0x9/0xd
[<c028c99e>] md_thread+0xed/0x104
[<c0134693>] autoremove_wake_function+0x0/0x2d
[<c028c8b1>] md_thread+0x0/0x104
[<c01345b1>] kthread+0x9d/0xc9
[<c0134514>] kthread+0x0/0xc9
[<c0102005>] kernel_thread_helper+0x5/0xb
Every other process was stopped in _spin_unlock_irq, schedule_timeout,
or something else that made sense. But mempool_alloc? Am I
interpreting the output incorrectly? Does anybody know what the two
hex numbers represent after the function name?
Any help or insight would be appreciated,
Ray Chen
--
David Zakar
Mail/Data Analyst
Postmaster Team
AOL, LLC