Hi,

This is a FAQ: I've answered it several times, but in different places,
so here's a definitive answer which will be my last one: future
questions will be directed to the list archives. :-)

On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner <[EMAIL PROTECTED]>
said:

>> then raid can miscalculate parity by assuming that the buffer matches
>> what is on disk, and that can actually cause damage to other data
>> than the data being written if a disk dies and we have to start using
>> parity for that stripe.

> do you know if using soft RAID5 + regular etx2 causes the same sort of
> damages, or if the corruption chances are lower when using a non
> journaled FS ?

Sort of.  See below.

> is the potential corruption caused by the RAID layer or by the FS
> layer ?  ( does need the FS code or the RAID code to be fixed ?)

It is caused by neither: it is an interaction effect.

> if it's caused by the FS layer, how does behave XFS (not here yet ;-)
> ) or ReiserFS in this case ?

They will both fail in the same way.

Right, here's the problem:

The semantics of the linux-2.2 buffer cache are not well defined with
respect to write ordering.  There is no policy to guide what gets
written and when: the writeback caching can trickle to disk at any time,
and other system components such as filesystems and the VM can force a
write-back of data to disk at any time.

Journaling imposes write ordering constraints which insist that data in
the buffer cache *MUST NOT* be written to disk unless the filesystem
explicitly says so.

RAID-5 needs to interact directly with the buffer cache in order to be
able to improve performance.

There are three nasty interactions which result:

1) RAID-5 tries to bunch writes of dirty buffers up so that all the data
   in a stripe gets written to disk at once.  For RAID-5, this is very
   much faster than dribbling the stripe back one disk at a time.
   Unfortunately, this can result in dirty buffers being written to disk
   earlier than the filesystem expected, with the result that on a
   crash, the filesystem journal may not be entirely consistent.

   This interaction hits ext3, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit set.

2) RAID-5 peeks into the buffer cache to look for buffer contents in
   order to calculate parity without reading all of the disks in a
   stripe.  If a journaling system tries to prevent modified data from
   being flushed to disk by deferring the setting of the buffer dirty
   flag, then RAID-5 will think that the buffer, being clean, matches
   the state of the disk and so it will calculate parity which doesn't
   actually match what is on disk.  If we crash and one disk fails on
   reboot, wrong parity may prevent recovery of the lost data.

   This interaction hits reiserfs, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit clear.

Both interactions 1) and 2) can be solved by making RAID-5 completely
avoid buffers which have an incremented b_count reference count, and
making sure that the filesystems all hold that count raised when the
buffers are in an inconsistent or pinned state.

3) The soft-raid backround rebuild code reads and writes through the
   buffer cache with no synchronisation at all with other fs activity.
   After a crash, this background rebuild code will kill the
   write-ordering attempts of any journalling filesystem.  

   This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

Interaction 3) needs a bit more work from the raid core to fix, but it's
still not that hard to do.


So, can any of these problems affect other, non-journaled filesystems
too?  Yes, 1) can: throughout the kernel there are places where buffers
are modified before the dirty bits are set.  In such places we will
always mark the buffers dirty soon, so the window in which an incorrect
parity can be calculated is _very_ narrow (almost non-existant on
non-SMP machines), and the window in which it will persist on disk is
also very small.

This is not a problem.  It is just another example of a race window
which exists already with _all_ non-battery-backed RAID-5 systems (both
software and hardware): even with perfect parity calculations, it is
simply impossible to guarantee that an entire stipe update on RAID-5
completes in a single, atomic operation.  If you write a single data
block and its parity block to the RAID array, then on an unexpected
reboot you will always have some risk that the parity will have been
written, but not the data.  On a reboot, if you lose a disk then you can
reconstruct it incorrectly due to the bogus parity.

THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
only way you can get bitten by this failure mode is to have a system
failure and a disk failure at the same time.


--Stephen

Reply via email to