RE: e2fsck not correcting RAID-5 recovered filesystem

Tom Livingston Wed, 8 Sep 1999 01:16:52 -0700
Ingo wrote:
> I will make the
> double-disk failure more graceful, it makes no sense to play hardball at
> that time anymore. We might be strict wrt. failures happening on a
> redundant array, but if it's in degraded mode we should either shut the
> array down immediately (as suggested before), or pass errors up, without
> modifying the superblocks.

Hey, cool!  This will be a great change in terms of usability.  I had a
great deal of heartache when I was building/upgrading a large 10 channel ide
disk array.  I was having rampant CRC errors, and until I got them under
control, I was frequently having to rebuild the superblocks so I could mount
degraded and reconstruct (again!) ;).

In my opinion, passing the errors on up would be preferable to shutting down
the array.  Especially in a situation where a parity rebuild causes us to
see an error on another disk on the set, which was located in a region that
was empty or contained a file that was not commonly read.  If you allow the
error on the new drive, the rebuild can still happen as planned, albeit with
corruption in the file or area that contained the new locational error.
Otherwise this disk set will be very hard to recover to a point that the
newly discovered faulty drive can be replaced.

On the subject of usability one thing that can lead to or exasperate this
situation is the fact that almost any raid5 array that gets shut down
incorrectly (due to a software or hardware crash) inevitably brings up the
array in degraded mode, where it can rebuild from.

This doesn't seem to be based on when it was writing, if I'm wrong, please
correct me.  But my general impression is that the array can be idle (or
essentially idle) for hours on end, but a bad shut-down (like a kernel
crash) will cause one of the disks to be identified as out of date on the
reboot.

I realize there may be performance implications here, where writes are being
delayed.  But on an essentially idle system, shouldn't the raid system
synchronize the disks as quickly as possible?  If it would be otherwise
sitting idle/with idle disks?  What ever the cause of condition, it would
sure be nice if the system didn't end up rebuilding every time you crash.

Tom
RE: e2fsck not correcting RAID-5 recovered filesystem

Reply via email to