Hello,
I've been having difficulty recently with my ide raid5 array. While it has
nothing to do with raid, any number of udma33 CRC errors will result in the
ide bus being reset.. often resulting in a situation where neither drive on
the bus can be read. While I've been working with this problem (replaced
the cables more than once, happens on different drives/channels, happens
infrequently).
As a result, I've found myself dealing with a raid5 array that had lost more
than one disk. While a reset and new mkraid will fix this state, a number
of issues that happen during the failure cause this to be more difficult.
Specifically, once the raid fails, any process trying to read from the array
will block instead of ever getting a failure. This includes umount, making
it impossible to unmount the array. At least on my system (RH60 base) not
being able to unmount the mounted file system cauese reboot to hang... it
needs a hard reset at that point. The machine is at a colocation point, so
this is really-quite-bad(tm)
I loaded up a couple extra drives on my development system, and got to
playing. After some inspection, I made a patch that fixes the problem for
me. I changed:
* raid5_make_request: added a block to test for failed_disks > 1, clear
needed flags in buffer_head and return early
* raid5_error: a simple failed_disks > 1 check to keep the "md: bug in file
raid5.c, line 659" from appearing
You can get the patch at
http://volition.org/~tsl/raid/raid5-clean-failure.patch.gz It applies
cleanly to linux-2.2.12 + raid0145-19990824-2.2.11, but it should apply to
most other revisions as well. The patch should be pretty safe, as it only
comes into play when you have more than 1 disk failed in a raid5 set... at
which point you're pretty much hosed currently.
If anyone feels up to it, give it a try and simulate some disk failure.
Helps a lot when trying to bring a controlled shutdown to the raid system.
Tom