subject:"\[FAQ\-answer\] Re\: soft RAID5 \+ journalled FS \+ power failure = problems \?"

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-06-16 Thread Andrea Arcangeli


On Fri, 14 Jan 2000, D. Lance Robinson wrote:

>Ingo,
>
>I can fairly regularly generate corruption (data or ext2 filesystem) on a busy
>RAID-5 by adding a spare drive to a degraded array and letting it build the
>parity. Could the problem be from the bad (illegal) buffer interactions you
>mentioned, or are there other areas that need fixing as well? I have been
>looking into this issue for a long time with no resolve. Since you may be aware
>of possible problem areas: any ideas, code or encouragement is greatly welcome.

Hy Lance, which raid code was you using when you had the above problem?
Was you using the raid 0.90 patches for 2.2.x, right?

Andrea

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-16 Thread Stephen C. Tweedie

Hi,

Chris Wedgwood writes:

 > > This may affect data which was not being written at the time of the
 > > crash.  Only raid 5 is affected.
 > 
 > Long term -- if you journal to something outside the RAID5 array (ie.
 > to raid-1 protected log disks) then you should be safe against this
 > type of failure?

Indeed.  The jfs journaling layer in ext3 is a completely generic
block device journaling layer which could be used for such a purpose
(and raid/LVM journaling is one of the reasons it was designed this
way).

--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-15 Thread Stephen C. Tweedie

Hi,

Benno Senoner writes:

 > wow, really good idea to journal to a RAID1 array !
 > 
 > do you think it is possible to to the following:
 > 
 > - N disks holding a soft RAID5  array.
 > - reserve a small partition on at least 2 disks of the array to hold a RAID1
 > array.
 > - keep the journal on this partition.

Yes.  My jfs code will eventually support this.  The main thing it is
missing right now is the ability to journal multiple devices to a
single journal: the on-disk structure is already designed with that in
mind but the code does not yet support it.

--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-14 Thread D. Lance Robinson

Ingo,

I can fairly regularly generate corruption (data or ext2 filesystem) on a busy
RAID-5 by adding a spare drive to a degraded array and letting it build the
parity. Could the problem be from the bad (illegal) buffer interactions you
mentioned, or are there other areas that need fixing as well? I have been
looking into this issue for a long time with no resolve. Since you may be aware
of possible problem areas: any ideas, code or encouragement is greatly welcome.

<>< Lance.

Ingo Molnar wrote:

> On Wed, 12 Jan 2000, Gadi Oxman wrote:
>
> > As far as I know, we took care not to poke into the buffer cache to
> > find clean buffers -- in raid5.c, the only code which does a find_buffer()
> > is:
>
> yep, this is still the case. (Sorry Stephen, my bad.) We will have these
> problems once we try to eliminate the current copying overhead.
> Nevertheless there are bad (illegal) interactions between the RAID code
> and the buffer cache, i'm cleaning up this for 2.3 right now. Especially
> the reconstruction code is a rathole. Unfortunately blocking
> reconstruction if b_count == 0 is not acceptable because several
> filesystems (such as ext2fs) keep metadata caches around (eg. the block
> group descriptors in the ext2fs case) which have b_count == 1 for a longer
> time.

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-14 Thread Benno Senoner


Chris Wedgwood wrote:

> > In the power+disk failure case, there is a very narrow window in which
> > parity may be incorrect, so loss of the disk may result in inability to
> > correctly restore the lost data.
>
> For some people, this very narrow window may still be a problem.
> Especially when you consider the case of a disk failing because of a
> power surge -- which also kills a drive.
>
> > This may affect data which was not being written at the time of the
> > crash.  Only raid 5 is affected.
>
> Long term -- if you journal to something outside the RAID5 array (ie.
> to raid-1 protected log disks) then you should be safe against this
> type of failure?
>
> -cw

wow, really good idea to journal to a RAID1 array !

do you think it is possible to to the following:

- N disks holding a soft RAID5  array.
- reserve a small partition on at least 2 disks of the array to hold a RAID1
array.
- keep the journal on this partition.

do you think that this will be possible ?
is ext3 / reiserfs  capable of keeping the journal on a different partition
than
the one holding the FS ?

That would really be great !

Benno.

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-13 Thread Stephen C. Tweedie


Hi,

On Wed, 12 Jan 2000 22:09:35 +0100, Benno Senoner <[EMAIL PROTECTED]>
said:

> Sorry for my ignorance I got a little confused by this post:

> Ingo said we are 100% journal-safe, you said the contrary,

Raid resync is safe in the presence of journaling.  Journaling is not
safe in the presence of raid resync.

> can you or Ingo please explain us in which situation (power-loss)
> running linux-raid+ journaled FS we risk a corrupted filesystem ?

Please read my previous reply on the subject (the one that started off
with "I'm tired of answering the same question a million times so here's
a definitive answer").  Basically, there will always be a small risk of
data loss if power-down is accompanied by loss of a disk (it's a
double-failure); and the current implementation of raid resync means
that journaling will be broken by the raid1 or raid5 resync code after a
reboot on a journaled filesystem (ext3 is likely to panic, reiserfs will
not but will still get its IO ordering requirements messed up by the
resync). 

> After the reboot if all disk remain intact physically, will we only
> lose the data that was being written, or is there a possibility to end
> up in a corrupted filesystem which could more damages in future ?

In the power+disk failure case, there is a very narrow window in which
parity may be incorrect, so loss of the disk may result in inability to
correctly restore the lost data.  This may affect data which was not
being written at the time of the crash.  Only raid 5 is affected.

--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 07:21:17 -0500 (EST), Ingo Molnar <[EMAIL PROTECTED]>
said:

> On Wed, 12 Jan 2000, Gadi Oxman wrote:

>> As far as I know, we took care not to poke into the buffer cache to
>> find clean buffers -- in raid5.c, the only code which does a find_buffer()
>> is:

> yep, this is still the case.

OK, that's good to know.

> Especially the reconstruction code is a rathole. Unfortunately
> blocking reconstruction if b_count == 0 is not acceptable because
> several filesystems (such as ext2fs) keep metadata caches around
> (eg. the block group descriptors in the ext2fs case) which have
> b_count == 1 for a longer time.

That's not a problem: we don't need reconstruction to interact with the
buffer cache at all.

Ideally, what I'd like to see the reconstruction code do is to:

* lock a stripe
* read a new copy of that stripe locally
* recalc parity and write back whatever disks are necessary for the stripe
* unlock the stripe

so that the data never goes through the buffer cache at all, but that
the stripe is locked with respect to other IOs going on below the level
of ll_rw_block (remember there may be IOs coming in to ll_rw_block which
are not from the buffer cache, eg. swap or journal IOs).

> We are '100% journal-safe' if power fails during resync. 

Except for the fact that resync isn't remotely journal-safe in the first
place, yes.  :-)

--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie


Hi,

On Tue, 11 Jan 2000 16:41:55 -0600, "Mark Ferrell"
<[EMAIL PROTECTED]> said:

>   Perhaps I am confused.  How is it that a power outage while attached
> to the UPS becomes "unpredictable"?  

One of the most common ways to get an outage while on a UPS is somebody
tripping over, or otherwise removing, the cable between the UPS and the
computer.  How exactly is that predictable?

Just because you reduce the risk of unexpected power outage doesn't mean
we can ignore the possibility.

--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

> Ideally, what I'd like to see the reconstruction code do is to:
>
> * lock a stripe
> * read a new copy of that stripe locally
> * recalc parity and write back whatever disks are necessary for the stripe
> * unlock the stripe
>
> so that the data never goes through the buffer cache at all, but that
> the stripe is locked with respect to other IOs going on below the level
> of ll_rw_block (remember there may be IOs coming in to ll_rw_block which
> are not from the buffer cache, eg. swap or journal IOs).
>
> > We are '100% journal-safe' if power fails during resync.
>
> Except for the fact that resync isn't remotely journal-safe in the first
> place, yes.  :-)
>
> --Stephen

Sorry for my ignorance I got a little confused by this post:

Ingo said we are 100% journal-safe, you said the contrary,

can you or Ingo please explain us in which situation (power-loss)
running linux-raid+ journaled FS we risk a corrupted filesystem ?

I am interested what happens if the power goes down while you write
heavily to a ext3/reiserfs (journaled FS) on soft-raid5 array.

After the reboot if all disk remain intact physically,
will we only lose the data that was being written, or is there a possibility
to end up in a corrupted filesystem which could more damages in future ?

(or do we need to wait for the raid code in 2.3 ?)

sorry for re-asking that question, but I am still confused.

regards,
Benno.

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Ingo Molnar

On Wed, 12 Jan 2000, Gadi Oxman wrote:

> As far as I know, we took care not to poke into the buffer cache to
> find clean buffers -- in raid5.c, the only code which does a find_buffer()
> is:

yep, this is still the case. (Sorry Stephen, my bad.) We will have these
problems once we try to eliminate the current copying overhead.
Nevertheless there are bad (illegal) interactions between the RAID code
and the buffer cache, i'm cleaning up this for 2.3 right now. Especially
the reconstruction code is a rathole. Unfortunately blocking
reconstruction if b_count == 0 is not acceptable because several
filesystems (such as ext2fs) keep metadata caches around (eg. the block
group descriptors in the ext2fs case) which have b_count == 1 for a longer
time.

If both power and a disk fails at once then we still might get local
corruption for partially written RAID5 stripes. If either power or a disk
fails, then the Linux RAID5 code is safe wrt. journalling, because it
behaves like an ordinary disk. We are '100% journal-safe' if power fails
during resync. We are also 100% journal-safe if power fails during
reconstruction of failed disk or in degraded mode.

the 2.3 buffer-cache enhancements i wrote ensure that 'cache snooping' and
adding to the buffer-cache can be done safely by 'external' cache
managers. I also added means to do atomic IO operations which in fact are
several underlying IO operations - without the need of allocating a
separate bh. The RAID code uses these facilities now.

Ingo

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Benno Senoner

James Manning wrote:

> [ Tuesday, January 11, 2000 ] Benno Senoner wrote:
> > The problem is that power outages are unpredictable even in presence
> > of UPSes therefore it is important to have some protection against
> > power losses.
>
> I gotta ask dying power supply? cord getting ripped out?
> Most ppl run serial lines (of course :) and with powerd they
> get nice shutdowns :)
>
> Just wanna make sure I'm understanding you...
>
> James
> --
> Miscellaneous Engineer --- IBM Netfinity Performance Development

yep, obviously the UPS has a serial line to shut down the machine nicely
before a failure,
but it happened to me that the serial cable was disconnected and the
power outage lasted
SEVERAL hours during a weekend , where no one was in the machine room (of
an ISP).

you know murphy's law ...
:-)

But I am mainly interested in the power-failure-protection in the case
where you want to setup
a workstation with a reliable disk array (soft raid5), and do not have
always an UPS handy,

you will loose the file that was being written, but the important thing
is that the disk array remains
in a safe state , just  like a single disk + journaled FS.

Sthephen Tweedie said that this is possible (by fixing the remaining
races in the RAID code),
if these problems will be fixed sometime, then our fears of a corrupted
soft-RAID array in
the case of a  power-failure on a machine without UPS will completely go
away.

cheers,
Benno.

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Mark Ferrell


  Perhaps I am confused.  How is it that a power outage while attached
to the UPS becomes "unpredictable"?  

  We run a Dell PowerEdge 2300/400 using Linux software raid and the
system monitors it's own UPS.  When power failure occures the system
will bring itself down to a minimal state (runlevel 1) after the
batteries are below 50% .. and once below 15% it will shutdown which
turns off the UPS.  When power comes back on the UPS fires up and the
system resumes as normal.

  Addmitedly this wont prevent issues like god reaching out and slapping
my system via lightning or something, nor will it resolve issues where
someone decides to grab the power cable and swing around on it severing
the connection from the UPS to the system .. but for the most part it
has thus far prooven to be a fairly decent configuration.

Benno Senoner wrote:
> 
> "Stephen C. Tweedie" wrote:
> 
> (...)
> 
> >
> > 3) The soft-raid backround rebuild code reads and writes through the
> >buffer cache with no synchronisation at all with other fs activity.
> >After a crash, this background rebuild code will kill the
> >write-ordering attempts of any journalling filesystem.
> >
> >This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.
> >
> > Interaction 3) needs a bit more work from the raid core to fix, but it's
> > still not that hard to do.
> >
> > So, can any of these problems affect other, non-journaled filesystems
> > too?  Yes, 1) can: throughout the kernel there are places where buffers
> > are modified before the dirty bits are set.  In such places we will
> > always mark the buffers dirty soon, so the window in which an incorrect
> > parity can be calculated is _very_ narrow (almost non-existant on
> > non-SMP machines), and the window in which it will persist on disk is
> > also very small.
> >
> > This is not a problem.  It is just another example of a race window
> > which exists already with _all_ non-battery-backed RAID-5 systems (both
> > software and hardware): even with perfect parity calculations, it is
> > simply impossible to guarantee that an entire stipe update on RAID-5
> > completes in a single, atomic operation.  If you write a single data
> > block and its parity block to the RAID array, then on an unexpected
> > reboot you will always have some risk that the parity will have been
> > written, but not the data.  On a reboot, if you lose a disk then you can
> > reconstruct it incorrectly due to the bogus parity.
> >
> > THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
> > only way you can get bitten by this failure mode is to have a system
> > failure and a disk failure at the same time.
> >
> 
> >
> > --Stephen
> 
> thank you very much for these clear explanations,
> 
> Last doubt: :-)
> Assume all RAID code - FS interaction problems get fixed,
> since a linux soft-RAID5 box has no battery backup,
> does this mean that we will loose data
> ONLY if there is a power failure AND successive disk failure ?
> If we loose the power and then after reboot all disks remain intact
> can the RAID layer reconstruct all information in a safe way ?
> 
> The problem is that power outages are unpredictable even in presence
> of UPSes therefore it is important to have some protection against
> power losses.
> 
> regards,
> Benno.

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Bryce Willing

- Original Message -
From: "Benno Senoner" <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Cc: "Linux RAID" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; "Ingo Molnar" <[EMAIL PROTECTED]>
Sent: Tuesday, January 11, 2000 1:17 PM
Subject: Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure
=problems ?

-- much snippage here

>
> The problem is that power outages are unpredictable even in presence
> of UPSes therefore it is important to have some protection against
> power losses.
>
> regards,
> Benno.
>
>

I run an MGE UPS on my RH6.1 box running RAID 1, they have software for
Linux that communicates with the UPS and performs an orderly system shutdown
if the box goes on battery and stays on battery for a given (user
selectable) length of time. I have tested and verified that this actually
works, it's a Good Thing(tm).
I did have to cut one pin on the standard RS-232 cable that came the UPS for
use on the Linux box, and download the software and install (scripted,
easy...)

bwilling

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie


Hi,

On Wed, 12 Jan 2000 00:12:55 +0200 (IST), Gadi Oxman
<[EMAIL PROTECTED]> said:

> Stephen, I'm afraid that there are some misconceptions about the
> RAID-5 code.

I don't think so --- I've been through this with Ingo --- but I
appreciate your feedback since I'm getting inconsistent advise here!
Please let me explain...

> In an early pre-release version of the RAID code (more than two years
> ago?), which didn't protect against that race, we indeed saw locked
> buffers changing under us from the point in which we computed the
> parity till the point in which they were actually written to the disk,
> leading to a corrupted parity.

That is not the race.  The race has nothing at all to do with buffers
changing while they are being used for parity: that's a different
problem, long ago fixed by copying the buffers.

The race I'm concerned about could occur when the raid driver wants to
compute parity for a stripe and finds some of the blocks are present,
and clean, in the buffer cache.  Raid assumes that those buffers
represent what is on disk, naturally enough.  So, it uses them to
calculate parity without rereading all of the disk blocks in the stripe.

The trouble is that the standard practice in the kernel, when modifying
a buffer, is to make the change and _then_ mark the buffer dirty.  If
you hit that window, then the raid driver will find a buffer which
doesn't match what is on disk, and will compute parity from that buffer
rather than from the on-disk contents.

> 1. n dirty blocks are scheduled for a stripe write.

That's not the race.  The problem occurs when only one single dirty
block is scheduled for a write, and we need to find the contents of the
rest of the stripe to compute parity.

> Point (2) is also incorrect; we have taken care *not* to peek into
> the buffer cache to find clean buffers and use them for parity
> calculations. We make no such assumptions.

Not according to Ingo --- can we get a definitive answer on this,
please?

Many thanks,
  Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread mauelsha

"Stephen C. Tweedie" wrote:
> 
> Hi,
> 
> On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha
> <[EMAIL PROTECTED]> said:
> 
> >> THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
> >> only way you can get bitten by this failure mode is to have a system
> >> failure and a disk failure at the same time.
> 
> > To try to avoid this kind of problem some brands do have additional
> > logging (to disk which is slow for sure or to NVRAM) in place, which
> > enables them to at least recognize the fault to avoid the
> > reconstruction of invalid data or even enables them to recover the
> > data by using redundant copies of it in NVRAM + logging information
> > what could be written to the disks and what not.
> 
> Absolutely: the only way to avoid it is to make the data+parity updates
> atomic, either in NVRAM or via transactions.  I'm not aware of any
> software RAID solutions which do such logging at the moment: do you know
> of any?
> 

AFAIK Veritas only does the first part of what i mentioned above
(invalid
on disk data recognition).

They do logging by default for RAID5 volumes and optionaly also for
RAID1 volumes.

In the RAID5 (with logging) case they can figure out if an n-1 disk
write took place and can
rebuild the data. In case an n-m (1 < m < n) took place they can
therefore at least
recognize the desaster ;-)

In the RAID1 (with logging) scenario they are able to recognize, which
of the n mirrors have actual
data and which ones don't to deliver the actual data to the user and to
try to make
the other mirrors consistent.

But because it's a software solution without any NVRAM support they
can't
handle the data redundancy case.

Heinz

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie


Hi,

On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha
<[EMAIL PROTECTED]> said:

>> THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
>> only way you can get bitten by this failure mode is to have a system
>> failure and a disk failure at the same time.

> To try to avoid this kind of problem some brands do have additional
> logging (to disk which is slow for sure or to NVRAM) in place, which
> enables them to at least recognize the fault to avoid the
> reconstruction of invalid data or even enables them to recover the
> data by using redundant copies of it in NVRAM + logging information
> what could be written to the disks and what not.

Absolutely: the only way to avoid it is to make the data+parity updates
atomic, either in NVRAM or via transactions.  I'm not aware of any
software RAID solutions which do such logging at the moment: do you know
of any?

--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread mauelsha

"Stephen C. Tweedie" wrote:
> 
> Hi,
> 
> This is a FAQ: I've answered it several times, but in different places,

> THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
> only way you can get bitten by this failure mode is to have a system
> failure and a disk failure at the same time.
> 

To try to avoid this kind of problem some brands do have additional
logging (to disk
which is slow for sure or to NVRAM) in place, which enables them to at
least recognize
the fault to avoid the reconstruction of invalid data or even enables
them to recover
the data by using redundant copies of it in NVRAM + logging information
what could be
written to the disks and what not.

Heinz

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie


Hi,

On Tue, 11 Jan 2000 20:17:22 +0100, Benno Senoner <[EMAIL PROTECTED]>
said:

> Assume all RAID code - FS interaction problems get fixed, since a
> linux soft-RAID5 box has no battery backup, does this mean that we
> will loose data ONLY if there is a power failure AND successive disk
> failure ?  If we loose the power and then after reboot all disks
> remain intact can the RAID layer reconstruct all information in a safe
> way ?

Yes.

--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Benno Senoner


"Stephen C. Tweedie" wrote:

(...)

>
> 3) The soft-raid backround rebuild code reads and writes through the
>buffer cache with no synchronisation at all with other fs activity.
>After a crash, this background rebuild code will kill the
>write-ordering attempts of any journalling filesystem.
>
>This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.
>
> Interaction 3) needs a bit more work from the raid core to fix, but it's
> still not that hard to do.
>
> So, can any of these problems affect other, non-journaled filesystems
> too?  Yes, 1) can: throughout the kernel there are places where buffers
> are modified before the dirty bits are set.  In such places we will
> always mark the buffers dirty soon, so the window in which an incorrect
> parity can be calculated is _very_ narrow (almost non-existant on
> non-SMP machines), and the window in which it will persist on disk is
> also very small.
>
> This is not a problem.  It is just another example of a race window
> which exists already with _all_ non-battery-backed RAID-5 systems (both
> software and hardware): even with perfect parity calculations, it is
> simply impossible to guarantee that an entire stipe update on RAID-5
> completes in a single, atomic operation.  If you write a single data
> block and its parity block to the RAID array, then on an unexpected
> reboot you will always have some risk that the parity will have been
> written, but not the data.  On a reboot, if you lose a disk then you can
> reconstruct it incorrectly due to the bogus parity.
>
> THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
> only way you can get bitten by this failure mode is to have a system
> failure and a disk failure at the same time.
>

>
> --Stephen

thank you very much for these clear explanations,

Last doubt: :-)
Assume all RAID code - FS interaction problems get fixed,
since a linux soft-RAID5 box has no battery backup,
does this mean that we will loose data
ONLY if there is a power failure AND successive disk failure ?
If we loose the power and then after reboot all disks remain intact
can the RAID layer reconstruct all information in a safe way ?

The problem is that power outages are unpredictable even in presence
of UPSes therefore it is important to have some protection against
power losses.

regards,
Benno.

[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie


Hi,

This is a FAQ: I've answered it several times, but in different places,
so here's a definitive answer which will be my last one: future
questions will be directed to the list archives. :-)

On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner <[EMAIL PROTECTED]>
said:

>> then raid can miscalculate parity by assuming that the buffer matches
>> what is on disk, and that can actually cause damage to other data
>> than the data being written if a disk dies and we have to start using
>> parity for that stripe.

> do you know if using soft RAID5 + regular etx2 causes the same sort of
> damages, or if the corruption chances are lower when using a non
> journaled FS ?

Sort of.  See below.

> is the potential corruption caused by the RAID layer or by the FS
> layer ?  ( does need the FS code or the RAID code to be fixed ?)

It is caused by neither: it is an interaction effect.

> if it's caused by the FS layer, how does behave XFS (not here yet ;-)
> ) or ReiserFS in this case ?

They will both fail in the same way.

Right, here's the problem:

The semantics of the linux-2.2 buffer cache are not well defined with
respect to write ordering.  There is no policy to guide what gets
written and when: the writeback caching can trickle to disk at any time,
and other system components such as filesystems and the VM can force a
write-back of data to disk at any time.

Journaling imposes write ordering constraints which insist that data in
the buffer cache *MUST NOT* be written to disk unless the filesystem
explicitly says so.

RAID-5 needs to interact directly with the buffer cache in order to be
able to improve performance.

There are three nasty interactions which result:

1) RAID-5 tries to bunch writes of dirty buffers up so that all the data
   in a stripe gets written to disk at once.  For RAID-5, this is very
   much faster than dribbling the stripe back one disk at a time.
   Unfortunately, this can result in dirty buffers being written to disk
   earlier than the filesystem expected, with the result that on a
   crash, the filesystem journal may not be entirely consistent.

   This interaction hits ext3, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit set.

2) RAID-5 peeks into the buffer cache to look for buffer contents in
   order to calculate parity without reading all of the disks in a
   stripe.  If a journaling system tries to prevent modified data from
   being flushed to disk by deferring the setting of the buffer dirty
   flag, then RAID-5 will think that the buffer, being clean, matches
   the state of the disk and so it will calculate parity which doesn't
   actually match what is on disk.  If we crash and one disk fails on
   reboot, wrong parity may prevent recovery of the lost data.

   This interaction hits reiserfs, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit clear.

Both interactions 1) and 2) can be solved by making RAID-5 completely
avoid buffers which have an incremented b_count reference count, and
making sure that the filesystems all hold that count raised when the
buffers are in an inconsistent or pinned state.

3) The soft-raid backround rebuild code reads and writes through the
   buffer cache with no synchronisation at all with other fs activity.
   After a crash, this background rebuild code will kill the
   write-ordering attempts of any journalling filesystem.  

   This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

Interaction 3) needs a bit more work from the raid core to fix, but it's
still not that hard to do.


So, can any of these problems affect other, non-journaled filesystems
too?  Yes, 1) can: throughout the kernel there are places where buffers
are modified before the dirty bits are set.  In such places we will
always mark the buffers dirty soon, so the window in which an incorrect
parity can be calculated is _very_ narrow (almost non-existant on
non-SMP machines), and the window in which it will persist on disk is
also very small.

This is not a problem.  It is just another example of a race window
which exists already with _all_ non-battery-backed RAID-5 systems (both
software and hardware): even with perfect parity calculations, it is
simply impossible to guarantee that an entire stipe update on RAID-5
completes in a single, atomic operation.  If you write a single data
block and its parity block to the RAID array, then on an unexpected
reboot you will always have some risk that the parity will have been
written, but not the data.  On a reboot, if you lose a disk then you can
reconstruct it incorrectly due to the bogus parity.

THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
only way you can get bitten by this failure mode is to have a system
failure and a disk failure at the same time.


--Stephen

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

20 matches

Site Navigation

Mail list logo

Footer information