Re: Status of ReiserFS + Journalling

2000-10-17 Thread lamont

On Thu, 5 Oct 2000, Neil Brown wrote:
>  2/ Arrange your filesystem so that you write new data to an otherwise
>unused stripe a whole stripe at a time, and store some sort of
>chechksum in the stripe so that corruption can be detected.  This
>implies a log structured filesystem (though possibly you could come
>close enough with a journalling or similar filesystem, I'm not
>sure).

This will hose your performance if you're doing random read/writes of
small chunks of data.  Its better in that case to have the size that your
app/fs writes be the same as the blocksize on a single disk, so that you
don't have to seek all the drives to the same cylinder every time you do a
read/write.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-17 Thread lamont

On Thu, 5 Oct 2000, Neil Brown wrote:
  2/ Arrange your filesystem so that you write new data to an otherwise
unused stripe a whole stripe at a time, and store some sort of
chechksum in the stripe so that corruption can be detected.  This
implies a log structured filesystem (though possibly you could come
close enough with a journalling or similar filesystem, I'm not
sure).

This will hose your performance if you're doing random read/writes of
small chunks of data.  Its better in that case to have the size that your
app/fs writes be the same as the blocksize on a single disk, so that you
don't have to seek all the drives to the same cylinder every time you do a
read/write.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-06 Thread Helge Hafting

Jeremy Fitzhardinge wrote:
> 
> On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote:
> > A power failure might leave you with a corrupt disk block.  That is
> > detectable (read failure) and you may then reconstruct it using the
> > rest of the stripe.  This will get you data from either before
> > or after the update was supposed to happen.
> 
> How would you be able to tell which disk contains the bad stripe?
> RAID reconstruction relies on knowing which disk to reconstruct because
> it's obviously bad - there's out of band information in the form
> of I/O errors.  If you only have an incompletely updated stripe on
> a disk, you don't know which data to reconstruct from parity.
> 
Correct.  RAID won't help you if one disk is updated flawlessly
but not the others.  It is a guard against disk breakdown only.

> I think the only way of doing this properly is to either have
> battery-backed cache, or by having journalling at the RAID level.

Isn't this something a journalling _fs_ is supposed to fix?
You don't really need journalling at the raid level,
the raid should (on a dirty startup) notice the dirtiness, and
check every stripe for correct parity.  (In a raid check, or
by using a degraded mode where every stripe is read
completely and checked before access.)

The raid can then inform the fs that the entire stripe is corrupt
when parity is bad, and the fs can fix this by replaying
its journal (or using a fsck).

Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-06 Thread Neil Brown

On Friday October 6, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> > Suppose, for stripe X the parity device is device 1 and we were
> > updating the block on device 0 at the time of system failure.
> > What had happened was that the new parity block was written out, but
> > the new data block wasn't.
> > Suppose further than when the system come back, device 2 has failed.
> > We now cannot recover the data that was on stripe X, device 2.  If we
> > tried, we would xor all the blocks from working devices together and I
> > hope that you can see that this would be the wrong answer.  This poor,
> > innocent, block, which hasn't been modified for years, has just been
> > corrupted.  Not good for PR.
> 
> Now that I'm getting better at thinking about this I can see that a very
> simple journal will protect from this particular problem.  A phase-tree
> style approach would likely to the job more efficiently, once again. 
> Here's the ultimate simple approach: why not treat an entire stripe as
> one block?  That way you never get 'innocent blocks' on your stripe.

yyye. uttt...

There was a detail (one of many probably) that I skipped in my brief
description of raid5.
Every block on the raid5 array has a two dimensional address 
   (discnumber, blocknumber)
This needs to be mapped into the linear address space expected by a
filesystem (unless you have a clever filesystem that understand two
dimensional addressing and copes with holes where the parity blocks
are).
Two extremes of ways to do this are:

  abc-   afk-
  de-f   bg-o
  g-hi   c-lp
  -jkl   -hmq
  mno-   din-
  qp-r   ej-r

where letters are logical block addresses, hyphens are parity blocks,
columns are drives, and rows a physical block numbers.

What is typically done it to define a cluster size, and then address
down the drive for a cluster, and then step across to the next driver
for the next cluster, so with a cluster size of 3, the above array
would be

  adg-
  beh-
  cfi-
  jm-p
  hn-q
  lo-r

(notice that the parity blocks come in clusters like the data blocks).
There is a trade off when choosing cluster size.
A cluster size of 1 (as it in the very first picture above) means that
any sequential access will probably use all drives, and so you should
see appropriate speed ups for read, and you might be able to avoid
reading old data for writes (as when you write a whole stripe you
don't need to read old data to calculate parity).
This is good if you have just a single thread accessing the array.

A large cluster size (e.g. 64k) means that most accesses will use only
one drive (for read) or two drives (for write - data + parity).  This
means that multiple threads that access the array concurrently and not
always be tripping over each other (sometimes, but not always) (this
is called 'head contention').

There is a formular that I have seen, but cannot remember, which links
typical IO size, and typical number of concurrent threads to ideal
cluster size.  

Issues of drive geometry come into this too.  If you are going to read
any of a track, you may as well read all of it.  So having a cluster
size that was a multiple of the track size would be good (if only
drives have constant sized tracks!!).

Back to your idea.  Having each stripe be one filesystem block means
either having large filesystem blocks (meaning lots of wastage) or
having a cluster size of 1.

Unfortuately, with Linux Software Raid, the minimum cluster size is one
page, and the maximum filesystem block size is one page, so we cannot
try this out on linux to see how it actually works.


My understand of the way WOFL works is that it uses RAID4, so there are
no parity holes to worry about (RAID4 has all the parity on the one
drive) and WOFL knows about the 2-D structure.
It tries to lay out whole files (or large clusters of each file) on to
one disk each, but hopes to have enough files that need writing at one time
that it can write them all, one onto each disc, and thus keep all the
discs busy while writing, but still have reduced head contention when
reading. 

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-06 Thread Daniel Phillips

Neil Brown wrote:
> Suppose, for stripe X the parity device is device 1 and we were
> updating the block on device 0 at the time of system failure.
> What had happened was that the new parity block was written out, but
> the new data block wasn't.
> Suppose further than when the system come back, device 2 has failed.
> We now cannot recover the data that was on stripe X, device 2.  If we
> tried, we would xor all the blocks from working devices together and I
> hope that you can see that this would be the wrong answer.  This poor,
> innocent, block, which hasn't been modified for years, has just been
> corrupted.  Not good for PR.

Now that I'm getting better at thinking about this I can see that a very
simple journal will protect from this particular problem.  A phase-tree
style approach would likely to the job more efficiently, once again. 
Here's the ultimate simple approach: why not treat an entire stripe as
one block?  That way you never get 'innocent blocks' on your stripe.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-06 Thread Neil Brown

On Friday October 6, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  Suppose, for stripe X the parity device is device 1 and we were
  updating the block on device 0 at the time of system failure.
  What had happened was that the new parity block was written out, but
  the new data block wasn't.
  Suppose further than when the system come back, device 2 has failed.
  We now cannot recover the data that was on stripe X, device 2.  If we
  tried, we would xor all the blocks from working devices together and I
  hope that you can see that this would be the wrong answer.  This poor,
  innocent, block, which hasn't been modified for years, has just been
  corrupted.  Not good for PR.
 
 Now that I'm getting better at thinking about this I can see that a very
 simple journal will protect from this particular problem.  A phase-tree
 style approach would likely to the job more efficiently, once again. 
 Here's the ultimate simple approach: why not treat an entire stripe as
 one block?  That way you never get 'innocent blocks' on your stripe.

yyye. uttt...

There was a detail (one of many probably) that I skipped in my brief
description of raid5.
Every block on the raid5 array has a two dimensional address 
   (discnumber, blocknumber)
This needs to be mapped into the linear address space expected by a
filesystem (unless you have a clever filesystem that understand two
dimensional addressing and copes with holes where the parity blocks
are).
Two extremes of ways to do this are:

  abc-   afk-
  de-f   bg-o
  g-hi   c-lp
  -jkl   -hmq
  mno-   din-
  qp-r   ej-r

where letters are logical block addresses, hyphens are parity blocks,
columns are drives, and rows a physical block numbers.

What is typically done it to define a cluster size, and then address
down the drive for a cluster, and then step across to the next driver
for the next cluster, so with a cluster size of 3, the above array
would be

  adg-
  beh-
  cfi-
  jm-p
  hn-q
  lo-r

(notice that the parity blocks come in clusters like the data blocks).
There is a trade off when choosing cluster size.
A cluster size of 1 (as it in the very first picture above) means that
any sequential access will probably use all drives, and so you should
see appropriate speed ups for read, and you might be able to avoid
reading old data for writes (as when you write a whole stripe you
don't need to read old data to calculate parity).
This is good if you have just a single thread accessing the array.

A large cluster size (e.g. 64k) means that most accesses will use only
one drive (for read) or two drives (for write - data + parity).  This
means that multiple threads that access the array concurrently and not
always be tripping over each other (sometimes, but not always) (this
is called 'head contention').

There is a formular that I have seen, but cannot remember, which links
typical IO size, and typical number of concurrent threads to ideal
cluster size.  

Issues of drive geometry come into this too.  If you are going to read
any of a track, you may as well read all of it.  So having a cluster
size that was a multiple of the track size would be good (if only
drives have constant sized tracks!!).

Back to your idea.  Having each stripe be one filesystem block means
either having large filesystem blocks (meaning lots of wastage) or
having a cluster size of 1.

Unfortuately, with Linux Software Raid, the minimum cluster size is one
page, and the maximum filesystem block size is one page, so we cannot
try this out on linux to see how it actually works.


My understand of the way WOFL works is that it uses RAID4, so there are
no parity holes to worry about (RAID4 has all the parity on the one
drive) and WOFL knows about the 2-D structure.
It tries to lay out whole files (or large clusters of each file) on to
one disk each, but hopes to have enough files that need writing at one time
that it can write them all, one onto each disc, and thus keep all the
discs busy while writing, but still have reduced head contention when
reading. 

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-06 Thread Helge Hafting

Jeremy Fitzhardinge wrote:
 
 On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote:
  A power failure might leave you with a corrupt disk block.  That is
  detectable (read failure) and you may then reconstruct it using the
  rest of the stripe.  This will get you data from either before
  or after the update was supposed to happen.
 
 How would you be able to tell which disk contains the bad stripe?
 RAID reconstruction relies on knowing which disk to reconstruct because
 it's obviously bad - there's out of band information in the form
 of I/O errors.  If you only have an incompletely updated stripe on
 a disk, you don't know which data to reconstruct from parity.
 
Correct.  RAID won't help you if one disk is updated flawlessly
but not the others.  It is a guard against disk breakdown only.

 I think the only way of doing this properly is to either have
 battery-backed cache, or by having journalling at the RAID level.

Isn't this something a journalling _fs_ is supposed to fix?
You don't really need journalling at the raid level,
the raid should (on a dirty startup) notice the dirtiness, and
check every stripe for correct parity.  (In a raid check, or
by using a degraded mode where every stripe is read
completely and checked before access.)

The raid can then inform the fs that the entire stripe is corrupt
when parity is bad, and the fs can fix this by replaying
its journal (or using a fsck).

Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Jeremy Fitzhardinge

On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote:
> A power failure might leave you with a corrupt disk block.  That is
> detectable (read failure) and you may then reconstruct it using the
> rest of the stripe.  This will get you data from either before 
> or after the update was supposed to happen.

How would you be able to tell which disk contains the bad stripe?
RAID reconstruction relies on knowing which disk to reconstruct because
it's obviously bad - there's out of band information in the form
of I/O errors.  If you only have an incompletely updated stripe on
a disk, you don't know which data to reconstruct from parity.

I think the only way of doing this properly is to either have
battery-backed cache, or by having journalling at the RAID level.

J

 PGP signature


Re: Status of ReiserFS + Journalling

2000-10-05 Thread Daniel Phillips

Neil Brown wrote:
> 
> For RAID5 a 'stripe' is a set of blocks, one from each underlying
> device, which are all at the same offset within their device.
> For each stripe, one of the blocks is a "parity" block - though it is
> a different block for each stripe (parity is rotated).
> 
> Content of the parity block is computed from the xor of the content of
> all the other (data) blocks.
> 
> To update a data block, you must also update the parity block to keep
> it consistant.  For example, you can read old partity block, read old
> data block, compute
>newparity = oldparity xor olddata xor newdata
> and then write out newparity and newdata.
> 
> It is not possible (on current hardware:-) to write both newparity and
> newdata to the different devices atomically.  If the system fails
> (e.g. power failure) between writing one and writing the other, then
> you have an inconsistant stripe.

OK, and not only newdata is corrupted, but n-2 of its unrelated
neighbors on the same stripe.  I see the problem.  I'm also...
beginning... to see... a solution.  Maybe.

[stuff I can't answer intelligently yet snipped]
> > Given a clear statement of the problem, I think I can show how to update
> > the stripes atomically.  At the very least, I'll know what interface
> > Tux2 needs from RAID in order to guarantee an atomic update.
> 
>  From my understanding, there are two ways to approach this problem.
> 
>  1/ store updates to a separate device, either NV ram or a separate
>   disc drive.  Providing you write address/oldvalue/newvalue to the
>   separate device before updating the main array, you could be safe
>   against single device failures combined with system failures.

A journalling filesystem.  Fine.  I'm sure Stephen has put plenty of
thought into this one.  Advantage: it's obvious how it helps the RAID
problem.  Disadvantage: you have the normal finicky journalling boundary
conditions to worry about.  Miscellaneous fact: you will be writing
everything twice (roughly).

>  2/ Arrange your filesystem so that you write new data to an otherwise
>unused stripe a whole stripe at a time, and store some sort of
>chechksum in the stripe so that corruption can be detected.  This
>implies a log structured filesystem (though possibly you could come
>close enough with a journalling or similar filesystem, I'm not
>sure).

I think it's true that Tux2's approach can do many of the things a LFS
can do.  And you can't tell by looking at a block which inode it belongs
to - I think we need to know this.  The obvious fix is to extend the
group metafile with a section that reverse maps each block using a
two-word inode:index pair.  (.2% extra space with 4K blocks.)

A nice fact about Tux2 is that the changes in a filesystem from phase to
phase can be completely arbitrary.  (LFS shares this property - it falls
out from doing copy-on-write towards the root of a tree.)  So you can
operate a write-twice algorithm like this: first clean out a number of
partially-populated stripes by branching the in-use blocks off to empty
stripes.  The reverse lookup is used to know which inode blocks to
branch.  You don't have to worry about writing full stripes because Tux2
will automatically revert to a consistent state on interruption.  When
you have enough clear space you cause a phase transition, and now you
have a consistent filesystem with lots of clear stripes into which you
can branch update blocks.

  Even numbered phases: Clear out freespace by branching orphan blocks
  Odd numbered phases: Branch updates into the new freespace

Notice that what we I'm doing up there very closely resembles an
incremental defrag, and can be tuned to really be a defrag.  This might
be useful.

What is accomplished is that we never kill innocent blocks in the nasty
way you described earlier.

I'm not claiming this is at all efficient - I smell a better algorithm
somewhere in there.  On the other hand, it's about the same efficiency
as journalling, doesn't have so many tricky boundary conditions, and you
get the defrag, something that is a lot harder to do with an
update-in-place scheme.  Correct me if I'm wrong, but don't people
running RAID care more about safety than speed?

This is just the first cut.  I think I have some sort of understanding
of the problem now, however imperfect.  I'll let it sit and percolate
for a while, and now I *must* stop doing this, get some sleep, then try
to prepare some slides for next week in Atlanta :-)

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Helge Hafting

Vojtech Pavlik wrote:

> Hmm, now that I think about it, this can be brought to data corruption
> even easier ... Imagine a case where a stripe isn't written completely.
> One of the drives (independently whether it's the xor one or one the
> other one) has thus invalid data.
> 
> Now how do you decide, after boot, which drive of the set, including the
> xor drive is it the one that contains the invalid data? I think this is
> not possible.
> 
A power failure might leave you with a corrupt disk block.  That is
detectable (read failure) and you may then reconstruct it using the
rest of the stripe.  This will get you data from either before 
or after the update was supposed to happen.

There is a requirement for this to work: never ever write to more
than one disk in the same stripe simultaneously.  (You can write
to all drives simultaneously, but a different stripe on each.)

I believe this is hard to achieve with the current implementation, as
raid-5 would have to override the elevator algorithms as well as
any caching internal to the drives.  And performance would probably
not be fantastic.  

A simple raid protects against disk breakdown, not power loss (or
kernel crash.)  There are UPS'es for power loss, and
battery-backed caches for further improvement.

Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Vojtech Pavlik

On Thu, Oct 05, 2000 at 09:49:29AM +0200, Andi Kleen wrote:
> On Thu, Oct 05, 2000 at 09:39:34AM +0200, Vojtech Pavlik wrote:
> > Hmm, now that I think about it, this can be brought to data corruption
> > even easier ... Imagine a case where a stripe isn't written completely.
> > One of the drives (independently whether it's the xor one or one the
> > other one) has thus invalid data.
> > 
> > Now how do you decide, after boot, which drive of the set, including the
> > xor drive is it the one that contains the invalid data? I think this is
> > not possible.
> 
> Normally only the parity block and the actually to be changed block in the
> stripe are updated, not all blocks in a stripe set.
> 
> When no disk fails then to be changed block may still contain the old value 
> after a crash (not worse than the no RAID case). parity will be fixed up to 
> make the RAID consistent again.  The other blocks are not touched. 

True, the result is no worse than the normal single disk case.

-- 
Vojtech Pavlik
SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Andi Kleen

On Thu, Oct 05, 2000 at 01:54:59PM +1100, Neil Brown wrote:
>  2/ Arrange your filesystem so that you write new data to an otherwise
>unused stripe a whole stripe at a time, and store some sort of
>chechksum in the stripe so that corruption can be detected.  This
>implies a log structured filesystem (though possibly you could come
>close enough with a journalling or similar filesystem, I'm not
>sure).

You don't need a checksum I think, just an atomically updated fs block 
-> actually stripe map would be enough. It can be only updated after
you wrote the new independent stripe completely. 

Simply using ordered writes for it (only write map after you wrote stripe)
may be tricky though, because you could get cyclic dependencies in a single
HW map block when the file system allocates many new stripes in parallel 
[so you would probably need something like soft updates and handling
of multiple versions of the map in core]  

Another method when you have a logging fs is to simply log the 
map block change into your normal log.  At least for ext3 and reiserfs
it would be expensive though, because they can only log complete changed
hardware blocks of the map.  JFS or XFS with item logging could do it 
relatively cheaply. 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Andi Kleen

On Thu, Oct 05, 2000 at 01:54:59PM +1100, Neil Brown wrote:
  2/ Arrange your filesystem so that you write new data to an otherwise
unused stripe a whole stripe at a time, and store some sort of
chechksum in the stripe so that corruption can be detected.  This
implies a log structured filesystem (though possibly you could come
close enough with a journalling or similar filesystem, I'm not
sure).

You don't need a checksum I think, just an atomically updated fs block 
- actually stripe map would be enough. It can be only updated after
you wrote the new independent stripe completely. 

Simply using ordered writes for it (only write map after you wrote stripe)
may be tricky though, because you could get cyclic dependencies in a single
HW map block when the file system allocates many new stripes in parallel 
[so you would probably need something like soft updates and handling
of multiple versions of the map in core]  

Another method when you have a logging fs is to simply log the 
map block change into your normal log.  At least for ext3 and reiserfs
it would be expensive though, because they can only log complete changed
hardware blocks of the map.  JFS or XFS with item logging could do it 
relatively cheaply. 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Vojtech Pavlik

On Thu, Oct 05, 2000 at 09:49:29AM +0200, Andi Kleen wrote:
 On Thu, Oct 05, 2000 at 09:39:34AM +0200, Vojtech Pavlik wrote:
  Hmm, now that I think about it, this can be brought to data corruption
  even easier ... Imagine a case where a stripe isn't written completely.
  One of the drives (independently whether it's the xor one or one the
  other one) has thus invalid data.
  
  Now how do you decide, after boot, which drive of the set, including the
  xor drive is it the one that contains the invalid data? I think this is
  not possible.
 
 Normally only the parity block and the actually to be changed block in the
 stripe are updated, not all blocks in a stripe set.
 
 When no disk fails then to be changed block may still contain the old value 
 after a crash (not worse than the no RAID case). parity will be fixed up to 
 make the RAID consistent again.  The other blocks are not touched. 

True, the result is no worse than the normal single disk case.

-- 
Vojtech Pavlik
SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Helge Hafting

Vojtech Pavlik wrote:

 Hmm, now that I think about it, this can be brought to data corruption
 even easier ... Imagine a case where a stripe isn't written completely.
 One of the drives (independently whether it's the xor one or one the
 other one) has thus invalid data.
 
 Now how do you decide, after boot, which drive of the set, including the
 xor drive is it the one that contains the invalid data? I think this is
 not possible.
 
A power failure might leave you with a corrupt disk block.  That is
detectable (read failure) and you may then reconstruct it using the
rest of the stripe.  This will get you data from either before 
or after the update was supposed to happen.

There is a requirement for this to work: never ever write to more
than one disk in the same stripe simultaneously.  (You can write
to all drives simultaneously, but a different stripe on each.)

I believe this is hard to achieve with the current implementation, as
raid-5 would have to override the elevator algorithms as well as
any caching internal to the drives.  And performance would probably
not be fantastic.  

A simple raid protects against disk breakdown, not power loss (or
kernel crash.)  There are UPS'es for power loss, and
battery-backed caches for further improvement.

Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Daniel Phillips

Neil Brown wrote:
 
 For RAID5 a 'stripe' is a set of blocks, one from each underlying
 device, which are all at the same offset within their device.
 For each stripe, one of the blocks is a "parity" block - though it is
 a different block for each stripe (parity is rotated).
 
 Content of the parity block is computed from the xor of the content of
 all the other (data) blocks.
 
 To update a data block, you must also update the parity block to keep
 it consistant.  For example, you can read old partity block, read old
 data block, compute
newparity = oldparity xor olddata xor newdata
 and then write out newparity and newdata.
 
 It is not possible (on current hardware:-) to write both newparity and
 newdata to the different devices atomically.  If the system fails
 (e.g. power failure) between writing one and writing the other, then
 you have an inconsistant stripe.

OK, and not only newdata is corrupted, but n-2 of its unrelated
neighbors on the same stripe.  I see the problem.  I'm also...
beginning... to see... a solution.  Maybe.

[stuff I can't answer intelligently yet snipped]
  Given a clear statement of the problem, I think I can show how to update
  the stripes atomically.  At the very least, I'll know what interface
  Tux2 needs from RAID in order to guarantee an atomic update.
 
  From my understanding, there are two ways to approach this problem.
 
  1/ store updates to a separate device, either NV ram or a separate
   disc drive.  Providing you write address/oldvalue/newvalue to the
   separate device before updating the main array, you could be safe
   against single device failures combined with system failures.

A journalling filesystem.  Fine.  I'm sure Stephen has put plenty of
thought into this one.  Advantage: it's obvious how it helps the RAID
problem.  Disadvantage: you have the normal finicky journalling boundary
conditions to worry about.  Miscellaneous fact: you will be writing
everything twice (roughly).

  2/ Arrange your filesystem so that you write new data to an otherwise
unused stripe a whole stripe at a time, and store some sort of
chechksum in the stripe so that corruption can be detected.  This
implies a log structured filesystem (though possibly you could come
close enough with a journalling or similar filesystem, I'm not
sure).

I think it's true that Tux2's approach can do many of the things a LFS
can do.  And you can't tell by looking at a block which inode it belongs
to - I think we need to know this.  The obvious fix is to extend the
group metafile with a section that reverse maps each block using a
two-word inode:index pair.  (.2% extra space with 4K blocks.)

A nice fact about Tux2 is that the changes in a filesystem from phase to
phase can be completely arbitrary.  (LFS shares this property - it falls
out from doing copy-on-write towards the root of a tree.)  So you can
operate a write-twice algorithm like this: first clean out a number of
partially-populated stripes by branching the in-use blocks off to empty
stripes.  The reverse lookup is used to know which inode blocks to
branch.  You don't have to worry about writing full stripes because Tux2
will automatically revert to a consistent state on interruption.  When
you have enough clear space you cause a phase transition, and now you
have a consistent filesystem with lots of clear stripes into which you
can branch update blocks.

  Even numbered phases: Clear out freespace by branching orphan blocks
  Odd numbered phases: Branch updates into the new freespace

Notice that what we I'm doing up there very closely resembles an
incremental defrag, and can be tuned to really be a defrag.  This might
be useful.

What is accomplished is that we never kill innocent blocks in the nasty
way you described earlier.

I'm not claiming this is at all efficient - I smell a better algorithm
somewhere in there.  On the other hand, it's about the same efficiency
as journalling, doesn't have so many tricky boundary conditions, and you
get the defrag, something that is a lot harder to do with an
update-in-place scheme.  Correct me if I'm wrong, but don't people
running RAID care more about safety than speed?

This is just the first cut.  I think I have some sort of understanding
of the problem now, however imperfect.  I'll let it sit and percolate
for a while, and now I *must* stop doing this, get some sleep, then try
to prepare some slides for next week in Atlanta :-)

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-05 Thread Jeremy Fitzhardinge

On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote:
 A power failure might leave you with a corrupt disk block.  That is
 detectable (read failure) and you may then reconstruct it using the
 rest of the stripe.  This will get you data from either before 
 or after the update was supposed to happen.

How would you be able to tell which disk contains the bad stripe?
RAID reconstruction relies on knowing which disk to reconstruct because
it's obviously bad - there's out of band information in the form
of I/O errors.  If you only have an incompletely updated stripe on
a disk, you don't know which data to reconstruct from parity.

I think the only way of doing this properly is to either have
battery-backed cache, or by having journalling at the RAID level.

J

 PGP signature


Re: Status of ReiserFS + Journalling

2000-10-04 Thread Daniel Phillips

Andi Kleen wrote:
> On Wed, Oct 04, 2000 at 01:42:46AM -0600, Andreas Dilger wrote:
> > You should ask the reiserfs mailing list for outstanding problems.  As
> > far as LVM is concerned, I don't think there is a problem, but watch out
> > for software RAID 5 and journalling filesystems (reiser or ext3, at least
> > under 2.2) - it can have problems if there is a disk crash.
> 
> It is not inherent to journaling file systems, linux software raid  5
> can always corrupt your data when you have a system crash with a disk
> crash (no way to write stripe sets atomically and half writen strip sets
> usually give random data for any crashed block in it when xored against parity)

'Atomic' - a word that makes my ears perk up.  Tux2 is all about atomic
updating.  Could you please give a simple statement of the problem for a
person who doesn't know much more about RAID than that it stands for
Redundant Array of Inexpensive Disks (Drives?)

Given a clear statement of the problem, I think I can show how to update
the stripes atomically.  At the very least, I'll know what interface
Tux2 needs from RAID in order to guarantee an atomic update.

> In this case "safe" just means that you don't need a fsck to be sure that
> the metadata is consistent -- data is never guaranteed to be consistent
> unless you have applications that use fsync/O_SYNC properly (=basically do
> their own journaling)

I truly believe that's a temporary situation.

> So overall I would not be worried too much, it isn't much worse with a
> journaled fs than it is without it.

But if it could be better...

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-04 Thread Andi Kleen

On Wed, Oct 04, 2000 at 01:42:46AM -0600, Andreas Dilger wrote:
> Magnus Naeslund writes:
> > The storage will be exported via ftp, samba, nfs & cvs.
> > I will patch the selected kernel to support LFS and LVM, and the filesystem
> > will run on that.
> > 
> > I am very interested in ReiserFS, and success/failure stories about it.
> 
> You really need to watch out when using ReiserFS as an NFS server.  There
> are patches to NFS in order to run on ReiserFS.  There were also problems
> (they may be fixed now) that force the NFS clients to be Linux only.

I never knew of any problems that didn't affect linux clients too (and 
I did most of the NFS patches) If you know any please let me know.

> 
> You should ask the reiserfs mailing list for outstanding problems.  As
> far as LVM is concerned, I don't think there is a problem, but watch out
> for software RAID 5 and journalling filesystems (reiser or ext3, at least
> under 2.2) - it can have problems if there is a disk crash.

It is not inherent to journaling file systems, linux software raid  5
can always corrupt your data when you have a system crash with a disk
crash (no way to write stripe sets atomically and half writen strip sets
usually give random data for any crashed block in it when xored against parity) 

Given that 2.4 should be safe.

In this case "safe" just means that you don't need a fsck to be sure that
the metadata is consistent -- data is never guaranteed to be consistent 
unless you have applications that use fsync/O_SYNC properly (=basically do 
their own journaling) 

So overall I would not be worried too much, it isn't much worse with a 
journaled fs than it is without it.



-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-04 Thread Andreas Dilger

Magnus Naeslund writes:
> The storage will be exported via ftp, samba, nfs & cvs.
> I will patch the selected kernel to support LFS and LVM, and the filesystem
> will run on that.
> 
> I am very interested in ReiserFS, and success/failure stories about it.

You really need to watch out when using ReiserFS as an NFS server.  There
are patches to NFS in order to run on ReiserFS.  There were also problems
(they may be fixed now) that force the NFS clients to be Linux only.

You should ask the reiserfs mailing list for outstanding problems.  As
far as LVM is concerned, I don't think there is a problem, but watch out
for software RAID 5 and journalling filesystems (reiser or ext3, at least
under 2.2) - it can have problems if there is a disk crash.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-04 Thread Andreas Dilger

Magnus Naeslund writes:
 The storage will be exported via ftp, samba, nfs  cvs.
 I will patch the selected kernel to support LFS and LVM, and the filesystem
 will run on that.
 
 I am very interested in ReiserFS, and success/failure stories about it.

You really need to watch out when using ReiserFS as an NFS server.  There
are patches to NFS in order to run on ReiserFS.  There were also problems
(they may be fixed now) that force the NFS clients to be Linux only.

You should ask the reiserfs mailing list for outstanding problems.  As
far as LVM is concerned, I don't think there is a problem, but watch out
for software RAID 5 and journalling filesystems (reiser or ext3, at least
under 2.2) - it can have problems if there is a disk crash.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Status of ReiserFS + Journalling

2000-10-04 Thread Andi Kleen

On Wed, Oct 04, 2000 at 01:42:46AM -0600, Andreas Dilger wrote:
 Magnus Naeslund writes:
  The storage will be exported via ftp, samba, nfs  cvs.
  I will patch the selected kernel to support LFS and LVM, and the filesystem
  will run on that.
  
  I am very interested in ReiserFS, and success/failure stories about it.
 
 You really need to watch out when using ReiserFS as an NFS server.  There
 are patches to NFS in order to run on ReiserFS.  There were also problems
 (they may be fixed now) that force the NFS clients to be Linux only.

I never knew of any problems that didn't affect linux clients too (and 
I did most of the NFS patches) If you know any please let me know.

 
 You should ask the reiserfs mailing list for outstanding problems.  As
 far as LVM is concerned, I don't think there is a problem, but watch out
 for software RAID 5 and journalling filesystems (reiser or ext3, at least
 under 2.2) - it can have problems if there is a disk crash.

It is not inherent to journaling file systems, linux software raid  5
can always corrupt your data when you have a system crash with a disk
crash (no way to write stripe sets atomically and half writen strip sets
usually give random data for any crashed block in it when xored against parity) 

Given that 2.4 should be safe.

In this case "safe" just means that you don't need a fsck to be sure that
the metadata is consistent -- data is never guaranteed to be consistent 
unless you have applications that use fsync/O_SYNC properly (=basically do 
their own journaling) 

So overall I would not be worried too much, it isn't much worse with a 
journaled fs than it is without it.



-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/