Re: Journaling FS and RAID

2000-06-28 Thread Stephen C. Tweedie

Hi,

On Wed, Jun 28, 2000 at 06:35:51PM +0200, Benno Senoner wrote:

  As far as I know the issue has been fixed in 2.4.* kernel series.
  ReiserFS and software RAID5 is NOT safe in 2.2.*
 
 but Stephen Tweedie (some time ago) pointed out that ,
 the only way to make a software raid system that survives (without data corruption)
 a power failure
 while in degraded mode ( this case is rare but it COULD happen),
 is to make a big RAID5 partition where you store the data and a small RAID1
 parition where
 you keep the journal of the RAID5 partition.

The real situation is a little more complex than that.  In degraded
mode, or if you lose a disk during a crash, ALL raid5 systems ---
hardware and software --- risk data loss unless they have some
transactional mechanism to allow them to write entire stripes
atomically with respect to power failure.

In practice, this is usually achieved (for hardware raid) by logging
the stripe updates to non-volatile memory.  (This is usually the same
memory that is used for the write-back cache, so it gives a natural
performance boost as well.)  Using a separate raid1 journal is
possible, but would be an odd way to deal with the problem given that
we're talking at the level of individual raid devices here.

For journaling *filesystems*, having the journal on an external raid1
disk is a great way to boost performance, but that doesn't fix the
raid5 problem above.

 He said ext3fs can be adapted for this, what is the current status ?

No I didn't!  I said that ext3 can in principle use off-disk journals,
but that is an entirely separate problem from the raid5 consistency
issue.  Making raid5 totally safe while in degraded mode *must*
require the cooperation of the raid layer itself --- it simply cannot
be done in the filesystem unless the filesystem guarantees 100% that
it only ever writes complete stripes at a time.

There are a number of ways this could be done --- in particular, there
have been a few projects recently (SWARM, Lustre) which would lend
themselves to this sort of operation, by layering the filesystem
on top of a log-based storage abstraction which could have the above
protection built in.

 last questions: are the current ext3 and reiserfs  raid-reconstruction safe ?

On 2.4, they should be --- the new raid code performs reconstruction
in a way which is invisible to the buffer cache layers.  Testers
welcome.  :-)

Cheers,
 Stephen



Re: fs-devel URL

2000-04-01 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 30, 2000 at 11:13:13PM +0200, Thomas Kotzian wrote:
 There was a discussion about LVM, reiserfs,... , and i need the URL or the address
 for the mailinglist for fs-devel the File-system development group.

[EMAIL PROTECTED]

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-16 Thread Stephen C. Tweedie

Hi,

Chris Wedgwood writes:

   This may affect data which was not being written at the time of the
   crash.  Only raid 5 is affected.
  
  Long term -- if you journal to something outside the RAID5 array (ie.
  to raid-1 protected log disks) then you should be safe against this
  type of failure?

Indeed.  The jfs journaling layer in ext3 is a completely generic
block device journaling layer which could be used for such a purpose
(and raid/LVM journaling is one of the reasons it was designed this
way).

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-15 Thread Stephen C. Tweedie

Hi,

Benno Senoner writes:

  wow, really good idea to journal to a RAID1 array !
  
  do you think it is possible to to the following:
  
  - N disks holding a soft RAID5  array.
  - reserve a small partition on at least 2 disks of the array to hold a RAID1
  array.
  - keep the journal on this partition.

Yes.  My jfs code will eventually support this.  The main thing it is
missing right now is the ability to journal multiple devices to a
single journal: the on-disk structure is already designed with that in
mind but the code does not yet support it.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power fai

2000-01-13 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 11:28:28 MET-1, "Petr Vandrovec"
[EMAIL PROTECTED] said:

   I did not follow this thread (on -fsdevel) too close (and I never
 looked into RAID code, so I should shut up), but... can you
 confirm that after buffer with data is finally marked dirty, parity
 is recomputed anyway? So that window is really small and same problems
 occurs every moment when you wrote data, but did not wrote parity yet?

Yes, that's what I said.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-13 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 22:09:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 Sorry for my ignorance I got a little confused by this post:

 Ingo said we are 100% journal-safe, you said the contrary,

Raid resync is safe in the presence of journaling.  Journaling is not
safe in the presence of raid resync.

 can you or Ingo please explain us in which situation (power-loss)
 running linux-raid+ journaled FS we risk a corrupted filesystem ?

Please read my previous reply on the subject (the one that started off
with "I'm tired of answering the same question a million times so here's
a definitive answer").  Basically, there will always be a small risk of
data loss if power-down is accompanied by loss of a disk (it's a
double-failure); and the current implementation of raid resync means
that journaling will be broken by the raid1 or raid5 resync code after a
reboot on a journaled filesystem (ext3 is likely to panic, reiserfs will
not but will still get its IO ordering requirements messed up by the
resync). 

 After the reboot if all disk remain intact physically, will we only
 lose the data that was being written, or is there a possibility to end
 up in a corrupted filesystem which could more damages in future ?

In the power+disk failure case, there is a very narrow window in which
parity may be incorrect, so loss of the disk may result in inability to
correctly restore the lost data.  This may affect data which was not
being written at the time of the crash.  Only raid 5 is affected.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 00:12:55 +0200 (IST), Gadi Oxman
[EMAIL PROTECTED] said:

 Stephen, I'm afraid that there are some misconceptions about the
 RAID-5 code.

I don't think so --- I've been through this with Ingo --- but I
appreciate your feedback since I'm getting inconsistent advise here!
Please let me explain...

 In an early pre-release version of the RAID code (more than two years
 ago?), which didn't protect against that race, we indeed saw locked
 buffers changing under us from the point in which we computed the
 parity till the point in which they were actually written to the disk,
 leading to a corrupted parity.

That is not the race.  The race has nothing at all to do with buffers
changing while they are being used for parity: that's a different
problem, long ago fixed by copying the buffers.

The race I'm concerned about could occur when the raid driver wants to
compute parity for a stripe and finds some of the blocks are present,
and clean, in the buffer cache.  Raid assumes that those buffers
represent what is on disk, naturally enough.  So, it uses them to
calculate parity without rereading all of the disk blocks in the stripe.

The trouble is that the standard practice in the kernel, when modifying
a buffer, is to make the change and _then_ mark the buffer dirty.  If
you hit that window, then the raid driver will find a buffer which
doesn't match what is on disk, and will compute parity from that buffer
rather than from the on-disk contents.

 1. n dirty blocks are scheduled for a stripe write.

That's not the race.  The problem occurs when only one single dirty
block is scheduled for a write, and we need to find the contents of the
rest of the stripe to compute parity.

 Point (2) is also incorrect; we have taken care *not* to peek into
 the buffer cache to find clean buffers and use them for parity
 calculations. We make no such assumptions.

Not according to Ingo --- can we get a definitive answer on this,
please?

Many thanks,
  Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 16:41:55 -0600, "Mark Ferrell"
[EMAIL PROTECTED] said:

   Perhaps I am confused.  How is it that a power outage while attached
 to the UPS becomes "unpredictable"?  

One of the most common ways to get an outage while on a UPS is somebody
tripping over, or otherwise removing, the cable between the UPS and the
computer.  How exactly is that predictable?

Just because you reduce the risk of unexpected power outage doesn't mean
we can ignore the possibility.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 07:21:17 -0500 (EST), Ingo Molnar [EMAIL PROTECTED]
said:

 On Wed, 12 Jan 2000, Gadi Oxman wrote:

 As far as I know, we took care not to poke into the buffer cache to
 find clean buffers -- in raid5.c, the only code which does a find_buffer()
 is:

 yep, this is still the case.

OK, that's good to know.

 Especially the reconstruction code is a rathole. Unfortunately
 blocking reconstruction if b_count == 0 is not acceptable because
 several filesystems (such as ext2fs) keep metadata caches around
 (eg. the block group descriptors in the ext2fs case) which have
 b_count == 1 for a longer time.

That's not a problem: we don't need reconstruction to interact with the
buffer cache at all.

Ideally, what I'd like to see the reconstruction code do is to:

* lock a stripe
* read a new copy of that stripe locally
* recalc parity and write back whatever disks are necessary for the stripe
* unlock the stripe

so that the data never goes through the buffer cache at all, but that
the stripe is locked with respect to other IOs going on below the level
of ll_rw_block (remember there may be IOs coming in to ll_rw_block which
are not from the buffer cache, eg. swap or journal IOs).

 We are '100% journal-safe' if power fails during resync. 

Except for the fact that resync isn't remotely journal-safe in the first
place, yes.  :-)

--Stephen



[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

This is a FAQ: I've answered it several times, but in different places,
so here's a definitive answer which will be my last one: future
questions will be directed to the list archives. :-)

On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 then raid can miscalculate parity by assuming that the buffer matches
 what is on disk, and that can actually cause damage to other data
 than the data being written if a disk dies and we have to start using
 parity for that stripe.

 do you know if using soft RAID5 + regular etx2 causes the same sort of
 damages, or if the corruption chances are lower when using a non
 journaled FS ?

Sort of.  See below.

 is the potential corruption caused by the RAID layer or by the FS
 layer ?  ( does need the FS code or the RAID code to be fixed ?)

It is caused by neither: it is an interaction effect.

 if it's caused by the FS layer, how does behave XFS (not here yet ;-)
 ) or ReiserFS in this case ?

They will both fail in the same way.

Right, here's the problem:

The semantics of the linux-2.2 buffer cache are not well defined with
respect to write ordering.  There is no policy to guide what gets
written and when: the writeback caching can trickle to disk at any time,
and other system components such as filesystems and the VM can force a
write-back of data to disk at any time.

Journaling imposes write ordering constraints which insist that data in
the buffer cache *MUST NOT* be written to disk unless the filesystem
explicitly says so.

RAID-5 needs to interact directly with the buffer cache in order to be
able to improve performance.

There are three nasty interactions which result:

1) RAID-5 tries to bunch writes of dirty buffers up so that all the data
   in a stripe gets written to disk at once.  For RAID-5, this is very
   much faster than dribbling the stripe back one disk at a time.
   Unfortunately, this can result in dirty buffers being written to disk
   earlier than the filesystem expected, with the result that on a
   crash, the filesystem journal may not be entirely consistent.

   This interaction hits ext3, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit set.

2) RAID-5 peeks into the buffer cache to look for buffer contents in
   order to calculate parity without reading all of the disks in a
   stripe.  If a journaling system tries to prevent modified data from
   being flushed to disk by deferring the setting of the buffer dirty
   flag, then RAID-5 will think that the buffer, being clean, matches
   the state of the disk and so it will calculate parity which doesn't
   actually match what is on disk.  If we crash and one disk fails on
   reboot, wrong parity may prevent recovery of the lost data.

   This interaction hits reiserfs, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit clear.

Both interactions 1) and 2) can be solved by making RAID-5 completely
avoid buffers which have an incremented b_count reference count, and
making sure that the filesystems all hold that count raised when the
buffers are in an inconsistent or pinned state.

3) The soft-raid backround rebuild code reads and writes through the
   buffer cache with no synchronisation at all with other fs activity.
   After a crash, this background rebuild code will kill the
   write-ordering attempts of any journalling filesystem.  

   This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

Interaction 3) needs a bit more work from the raid core to fix, but it's
still not that hard to do.


So, can any of these problems affect other, non-journaled filesystems
too?  Yes, 1) can: throughout the kernel there are places where buffers
are modified before the dirty bits are set.  In such places we will
always mark the buffers dirty soon, so the window in which an incorrect
parity can be calculated is _very_ narrow (almost non-existant on
non-SMP machines), and the window in which it will persist on disk is
also very small.

This is not a problem.  It is just another example of a race window
which exists already with _all_ non-battery-backed RAID-5 systems (both
software and hardware): even with perfect parity calculations, it is
simply impossible to guarantee that an entire stipe update on RAID-5
completes in a single, atomic operation.  If you write a single data
block and its parity block to the RAID array, then on an unexpected
reboot you will always have some risk that the parity will have been
written, but not the data.  On a reboot, if you lose a disk then you can
reconstruct it incorrectly due to the bogus parity.

THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
only way you can get bitten by this failure mode is to have a system
failure and a disk failure at the same time.


--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha
[EMAIL PROTECTED] said:

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.

 To try to avoid this kind of problem some brands do have additional
 logging (to disk which is slow for sure or to NVRAM) in place, which
 enables them to at least recognize the fault to avoid the
 reconstruction of invalid data or even enables them to recover the
 data by using redundant copies of it in NVRAM + logging information
 what could be written to the disks and what not.

Absolutely: the only way to avoid it is to make the data+parity updates
atomic, either in NVRAM or via transactions.  I'm not aware of any
software RAID solutions which do such logging at the moment: do you know
of any?

--Stephen



Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-07 Thread Stephen C. Tweedie

Hi,

On Fri, 07 Jan 2000 13:26:21 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 what happens when I run RAID5+ jornaled FS and the box is just writing
 data to the disk and then a power outage occurs ?

 Will this lead to a corrupted filesystem or will only the data which
 was just written, be lost ?

It's more complex than that.  Right now, without any other changes, the
main danger is that the raid code can sometimes lead to the filesystem's
updates being sent to disk in the wrong order, so that on reboot, the
journaling corrupts things unpredictably and silently.

There is a second effect, which is that if the journaling code tries to
prevent a buffer being written early by keeping its dirty bit clear,
then raid can miscalculate parity by assuming that the buffer matches
what is on disk, and that can actually cause damage to other data than
the data being written if a disk dies and we have to start using parity
for that stripe.

Both are fixable, but for now, be careful...

--Stephen



Re: Best way to set up swap for high availability?

1999-12-06 Thread Stephen C. Tweedie

Hi,

On Fri, 26 Nov 1999 18:04:27 +0100, Martin Bene [EMAIL PROTECTED] said:

 At 11:35 25.11.99 +0100, Thomas Waldmann wrote:
 What's more interesting for me: how about swap on RAID-5 ?

 Personaly, I've only used raid1, but I can give you a quote from Ingo - and
 he should know:

 At 14:49 14.04.99 +0200, Ingo Molnar wrote:
 Hmm? Since when does swapping work on raid-1? How about raid-5?
 
 i've tested it on RAID5, swapping madly to a RAID5 array while parity is
 being reconstructed works just fine.

Sorry, but since then we did find a fault.  Raid resync goes through the
buffer cache.  Swap bypasses the buffer cache.  There is no coherency
between the two activities.  It is possible for raid1 and raid5
background resync to corrupt swap writes to the partition during
reconstruction.

We need to fix this anyway, since the same problem bites journaling. 

--Stephen



Re: Best way to set up swap for high availability?

1999-12-06 Thread Stephen C. Tweedie

Hi,

On Mon, 6 Dec 1999 16:11:14 -0500 (EST), Andy Poling
[EMAIL PROTECTED] said:

 On Mon, Dec 06, 1999 at 02:53:22PM +, Stephen C. Tweedie wrote:
 Sorry, but since then we did find a fault.  Raid resync goes through the
 buffer cache.  Swap bypasses the buffer cache.  There is no coherency
 between the two activities.  It is possible for raid1 and raid5
 background resync to corrupt swap writes to the partition during
 reconstruction.

 Stephen, does this also hold true if one is swapping to files that happen to
 be located on a softare raid partition?

Yes, 'fraid so.

--Stephen



Re: Best way to set up swap for high availability?

1999-12-06 Thread Stephen C. Tweedie

Hi,

On Mon, 6 Dec 1999 20:17:12 +0100, Luca Berra [EMAIL PROTECTED] said:

 do you mean that the problem arises ONLY, when a disk fails and has to
 be reconstructed?  

No, it can happen any time the kernel does a resync after an unclean
shutdown.

--Stephen



RE: Bad rawio/raid performance

1999-10-26 Thread Stephen C. Tweedie

Hi,


On Tue, 19 Oct 1999 20:12:20 -0700, "Tom Livingston" [EMAIL PROTECTED]
said:

 Has anyone else tried raw-io with md devices?  It works for me but the
 performance is quite bad.

 This is a recently reported issue on the linux-kernel mailing list.
 The jist of it is that rawio is using a 512 byte blocksize, where raid
 assumes a 1024. This was only first reported a couple of days ago
 (10/16)

Yep.  It's not clear just yet exactly how best to fix this --- the
hacked patch which forces the raw IO blocksize to 1024 will break
applications which (legitimately) expect to be able to perform 512-byte
IOs on the raw device.  I'll let people know once we've figured out how
to get 512-byte IOs working on raid decently.

--Stephen



RE: Bad rawio/raid performance

1999-10-26 Thread Stephen C. Tweedie

Hi,

On Tue, 26 Oct 1999 11:42:41 -0400 (EDT), David Holl
[EMAIL PROTECTED] said:

 would specifying differing input  output block sizes with dd help?

Unfortunately not, no.  The underlying device blocksize is set when the
device is first opened.

--Stephen



Re: (reiserfs) Re: 71% full raid - no space left on device

1999-10-21 Thread Stephen C. Tweedie

Hi,

On Wed, 20 Oct 1999 13:12:23 +0400, Hans Reiser [EMAIL PROTECTED]
said:

 We don't have inodes in our FS, but we do have stat data, and that
 is dynamically allocated (dynamic per FS, not per file yet, soon but
 not yet each field will be optional and inheritable per file).

 Does XFS dynamically allocate?  It might.

I believe so, yes.  They do have traditional-looking block groups, but
within each group they can allocate blocks arbitrarily to hold inode
data.  (This is just from memory of their description at the Darmstadt
workshop.)

--Stephen



Re: networked RAID-1

1999-10-12 Thread Stephen C. Tweedie

Hi,

On Mon, 11 Oct 1999 17:02:27 -0500, Stephen Waters [EMAIL PROTECTED]
said:

 This blurb in the latest Kernel Traffic has some status information on
 ext3 and ACLs that might be relevant. 12-18mo for a really stable
 version, but version 0.02 is supposed (maybe already) to be out very
 soon.

If by ext3 you mean journaling, I'm expecting 6 months for a really
stable version, and I expect to see people deploying it in anger within
3.  I already have it on all my laptop filesystems, for example (after a
couple of bugfixes while I was on the road last week).  New release
today.

--Stephen



Re: networked RAID-1

1999-10-11 Thread Stephen C. Tweedie

Hi,

On Thu, 7 Oct 1999 01:59:31 -0500, [EMAIL PROTECTED]
(G.W. Wettstein) said:

 If this works, you can also add a third machine and make a threefold
 raid1 for added HA. Curious myself if this would work. Unfortunately
 cannot test this myself.

 This strategy for doing HA has interested us as well.  Just a few
 comments:

 First of all the current NBD implementation, at least the pieces of it
 that we have been able to find, is not sufficiently robust to
 implement this strategy in a production environment.  

There are at least two teams working on beefing up NBD, including the
addition of proper connection resync and a kernel-based server.

A much more thorny problem is the management of network breaks between
the two disks --- you have to deal with all the clustering issues
surrounding quorum management or you end up with both disks failing the
other side over to themselves and you can't resolve the conflict
afterwards.

--Stephen



Re: networked RAID-1

1999-10-11 Thread Stephen C. Tweedie

Hi,

On Mon, 11 Oct 1999 13:55:23 -0400, Tom Kunz [EMAIL PROTECTED] said:

 Stephen (and others who might know),
   Are there homepages and/or mailing lists for these teams?  I would be
 highly interested in participating...

One is the GFS team at http://gfs.lcse.umn.edu/.  The other hasn't
announced publicly yet.

--Stephen



Re: networked RAID-1

1999-10-11 Thread Stephen C. Tweedie

Hi,

On Mon, 11 Oct 1999 16:58:46 -0400, Tom Kunz [EMAIL PROTECTED] said:

   Hmm, well GFS isn't exactly an improvement on NBD, it's more like an
 entirely different filesystem type.  

GFS is a shared disk filesystem.  It doesn't care how the disk is
shared, and one of the side projects they have taken on is to extend nbd
to provide a level of functionality at which they could run GFS over
nbd.  The resulting gnbd code is on the GFS cvs repository afaik: I can
look out and post the gnbd announcement if you like.

 I was talking with Simon Horman of VA-Research at Internet World in
 NYC this past week, and he feels that it'll be 12 to 18 months until
 we have ext3

ext3 should be usable by Christmas/new year.

 and/or some other kind of nicely-working, network-distributed
 filesystem (such as GFS).  

InterMezzo will be there _much_ sooner by all accounts.  It has already
been demonstrated under serious load, and Peter is spending a lot of
time on it right now.  InterMezzo is a more loosely coupled filesystem
than GFS, but should be perfect for jobs which do not require shared
write access to single files.  See http://www.inter-mezzo.org/.  It's
exciting stuff. :)

Cheers,
 Stephen



Re: raid0 and raw io

1999-08-18 Thread Stephen C. Tweedie

Hi,

On Thu, 29 Jul 1999 09:38:20 -0700, Carlos Hwa [EMAIL PROTECTED]
said:

 I have a 2 disk raid0 with 32k chunk size using raidtools 0.90 beta10
 right now, and have applied stephen tweedie's raw i/o patch. the raw io
 patch works fine with a single disk but if i try to use raw io on
 /dev/md0 for some reason transfer sizes are only 512bytes according to
 the scsi analyzer, no matter what i specify (i am using lmdd from
 lmbench to test, lmdd if=/dev/zero of=/dev/raw1 bs=65536 count=2048,
 /dev/raw1 is the raw device for /dev/md0). Mr. tweedie says it should
 work correctly, so could this be a limitation with the linux raid
 software? Thanks.

I'm back from holiday, so...

Ingo, any thoughts on this?  The raw IO code is basically just stringing
together temporary buffer_heads and then submitting them all, as a
single call, to ll_rw_block (up to a limit of 128 sectors per call).
The IOs are ordered, so attempt_merge() should be happy enough about
merging.  The only thing I can think of which is somewhat unusual about
the IOs is that the device's blocksize is unconditionally set to 512
bytes beforehand: will that confuse md's block merging?

--Stephen



Re: A couple of... pearls?

1999-04-26 Thread Stephen C. Tweedie

Hi,

On Sat, 24 Apr 1999 21:09:05 +0200 (MEST), Francisco Jose Montilla
[EMAIL PROTECTED] said:

   Hi, I happen to came across a couple of statements that somewhat
 involves the use of RAID, statements that I believe are not absolutely
 correct, if not false, or half truths.

 ---
 [...]
 Keep in mind that 99 percent of PC hardware is garbage. A friend of mine
 was a small-time Internet service provider. He was running BSDI, a
 not-quite-free Unix, on a bunch of PC clones. A hard disk was generating
 errors. He reloaded from backup tape. He still got errors. It turned out
 that his SCSI controller had gone bad some weeks before. It had corrupted
 both the hard disk and the backup tapes. He lost all of his data.  He lost
 all of his clients' data.

  Lesson 1: You are less likely to lose with a SCSI controller designed
 by a real engineer in the Hewlett-Packard Unix workstation division than
 you are with one thrown in on a $49 sound card.

  Lesson 2: Mirrored disks on separate SCSI chains. Period. 
 

No.  Lesson number zero: check the consistency of your backups.
Regularly.

   I know the HP part is gonna make Dietmar's delights :). Apart from
 that, I wonder:

   - Doesn't SCSI controllers use parity? (Although you have to
 enable it, of course)

Yes, if the controller supports it, and all modern controllers do.  I
don't even think any of our drivers let you disable it any more.

However, most of the cheapo sound-card-based scsi controllers (which
were first designed as a cheap way of interfacing to a cdrom) don't do
parity.  Run raid on that?  Yeah, right...

   - I agree on using two *controllers* (not two channels on the same
 controller) gives appropiate redundancy if one of they go mad, but
 nonetheless, although we use only one, shouldn't data corruption be
 detected by the controller parity? 

No.  Errors generated on the cable will be detected.  Bus/memory errors
will not; soft errors in the controller will not; and errors in the disk
itself will not.

 One step further, how will the soft RAID code handle this? does it
 have some heuristics to detect that, or is completelly the task of the
 controller and imposible for soft RAID to detect that?

If the IO completes with the status "OK, all IO finished fine", the RAID
code believes it.

 --
 Why would I want a two channel RAID card for RAID one? 

 By putting each harddrive on a separate channel, you can ensure that even
 if a cable or terminator on one channel were to go bad, the system would
 continue to function.

 When hot-swapping a harddrive, the RAID card must temporarily stop the
 SCSI channel the drive is attached to. If the other drive in a RAID one
 array is connected to a different channel, the computer can operate
 completely normally during the hot-swap.
  

   I agree completely with the first statement. But the second sounds
 somewhat odd to me. I can hotadd or hotremove a disk on linux with sw RAID
 and a non-hot swappable capable controller, maybe this is another feature
 of sw RAID over hw RAID? 

You can try, but if the bus is active while you do it, chances are
you'll corrupt data.  There _are_ specially designed raid cabinets
which electrically isolate the bus so that you can do this safely, but
that's not the case for your typical scsi bus.

--Stephen



Re: Benchmarks/Performance.

1999-04-26 Thread Stephen C. Tweedie

Hi,

On Thu, 22 Apr 1999 20:45:52 +0100 (IST), Paul Jakma [EMAIL PROTECTED]
said:

 i tried this with raid0, and if bonnie is any guide, the optimal
 configuration is 64k chunk size, 4k e2fs block size.  

Going much above 64k will mean that readahead has to work very much
harder to keep all the pipelines full when doing large sequential IOs.
That's why bonnie results can fall off.  However, if you have
independent IOs going on (web/news/mail service or multiuser machines)
then that concurrent activity may still be faster with larger chunk
sizes, as you minimise the chance of any one file access having to cross
multiple disks.

In other words, all benchmarks lie. :)

--Stephen



Re: Benchmarks/Performance.

1999-04-26 Thread Stephen C. Tweedie

Hi,

On Mon, 26 Apr 1999 21:28:20 +0100 (IST), Paul Jakma [EMAIL PROTECTED]
said:

 it was close between 32k and 64k. 128k was noticably slower (for
 bonnie) so i didn't bother with 256k. 

Fine, but 128k will be noticeably faster for some other tasks.  Like I
said, it depends on whether you prioritise large-file bandwidth over the
ability to serve many IOs at once.

 viz pipelining: would i be right in thinking that a decent scsi
 controller and drives can "pipeline" /far/ better than, eg, a udma
 setup?

Yes, although you eventually run into a different bottleneck: the
filesystem has to serialise every so often while reading its indirection
metadata blocks.  Using a 4k fs blocksize helps there (again, for
squeezing the last few %age points out of sequential readahead).

 ie the optimal chunk size would be higher for a scsi system than for
 an eide/udma setup?

udma can do readahead and multi-sector IOs.  scsi can have limited
tagged queue depths.  Command setup is more expensive on scsi than on
ide.  Which costs dominate really depends on the workload.

--Stephen



Re: So, it's up -- and I'm beating it, now about that boot..

1999-04-20 Thread Stephen C. Tweedie

Hi,

On Sat, 17 Apr 1999 16:22:59 -0400 (EDT), "m. allan noah"
[EMAIL PROTECTED] said:

 have you ACTUALLY used grub to boot off of raid1? i dont see how grub is
 capable. it would have to be able to read the md device. prove me wrong
 please.

raid-1 has the property that the raid superblock is at the end of the
partition, so that the filesystem contained inside the raid starts at
the start of each of the component raid partitions.  In other words,
each partition in the raid set looks like a perfectly formed ext2fs
filesystem which just happens to be 64k smaller than the total partition
size.

Grub should be able to read this just fine.

Cheers,
 Stephen



Re: Swap on raid

1999-04-15 Thread Stephen C. Tweedie

Hi,

On 15 Apr 1999 00:13:48 -, [EMAIL PROTECTED] said:

 AFAIK, the swap code uses raw file blocks on disk, rather than passing
 through to vfs, cause you dont want to cache swap accesses, think
 about it :)

Sort of correct.  It does bypass most of the VFS, but it does use the
standard block device IO routines.

 this is how swap can work on a partition or a file, cause at swapon
 time, the blocks are mapped for direct access.

No, for files, we do the mapping on demand, not all at once on swapon. 

 swap running on raid then, if it works at all, is not actually
 protecting you.  

Yes it is.  Swapping is not done inside the VFS, but neither is RAID.
RAID works under the hood of the block device IO routines
(drivers/block/ll_rw_block.c), so both VFS and swap will take full
advantage of any RAID devices being used.

--Stephen



RE: Swap on raid

1999-04-15 Thread Stephen C. Tweedie

Hi,

On Wed, 14 Apr 1999 15:32:40 -0400, "Joe Garcia" [EMAIL PROTECTED] said:

 Swapping to a file should work, but if I remember correctly you get
 horrible performance.

Swap-file performance on 2.2 kernels is _much_ better.

--Stephen



Re: Swap on raid

1999-04-15 Thread Stephen C. Tweedie

Hi,

On Wed, 14 Apr 1999 21:59:49 +0100 (BST), A James Lewis [EMAIL PROTECTED]
said:

 It wasn't a month ago that this was not possible because it needed to
 allocate memory for the raid and couldn't because it needed to swap to
 do it?  Was I imagining this or have you guys been working too hard!

There may well have been a few possible deadlocks, but the current
kswapd code is pretty careful to avoid them.  Things should be OK.

--Stephen



Re: partition type to autodetect raid

1999-04-08 Thread Stephen C. Tweedie

Hi,

 The only place I would even imagine this would be possible would be in
 the mode pages, but my recollection of the SCSI standard says that all
 of these modes pages are read only. :(

IIRC there are some writable fields in some drives to allow you to set
caching/writeback behaviour, for example, but even then they are
definitely not persisitent.  There's nowhere to store a permanent
data-type marker.

--Stephen



Re: partition type to autodetect raid

1999-04-01 Thread Stephen C. Tweedie

Hi,

On Sun, 28 Mar 1999 15:27:26 -0500 (EST), Laszlo Vecsey
[EMAIL PROTECTED] said:

 Isnt there room in the raid header for an additional flag to mark the
 'partition' type? I realize this might require a 'mkraid --upgrade' to be
 run, but at least the 'partitions' could then be detected and then I could
 root automount more cleanly for example..

That's not the point: we don't even _look_ for a raid superblock unless
the partition is marked for autostart.  There are related problems we need to
deal with regularly when building filesystems: what happens if you
reformat a raid disk as a single ext2fs filesystem?  The raid superblock
remains intact, but we do _not_ want to autostart it.

That's why it's best to leave things as they are: if a partition is not
recognisable as having a raid superblock, we don't autostart it.

--Stephen



Re: Filesystem corruption (was: Re: Linux 2.2.4 RAID - success report)

1999-04-01 Thread Stephen C. Tweedie

Hi,

On Mon, 29 Mar 1999 11:28:25 +0100, Richard Jones
[EMAIL PROTECTED] said:

 Not so fast there :-)

 In the stress tests, I've encountered almost silent
 filesystem corruption. The filesystem reports errors
 as attached below, but the file operations continue
 without error, corrupting files in the process. At
 no time did the RAID software report any problem, nor
 did any reconstruction kick in.

 Anyone have any ideas what might be going on? It doesn't
 seem to be exclusively a 2.2.4 thing. I've seen similar
 problems with 2.0.36-19990128.

This is a pretty good indication of a hardware fault.  Looking at the
messages:

 Mar 26 20:52:35 fred kernel: EXT2-fs error (device md(9,0)): ext2_free_blocks: 
Freeing blocks not in datazone - block = 550046767, count = 1
 Mar 26 20:52:36 fred kernel: EXT2-fs error (device md(9,0)): ext2_free_blocks: 
Freeing blocks not in datazone - block = 536870912, count = 1
 Mar 27 10:47:59 fred kernel: EXT2-fs error (device md(9,0)):
 ext2_free_blocks: Freeing blocks not in datazone - block = 538609421,
 count = 1

these are block numbers (in hex): 20C90C2F, 2000, 201A870D.
Something is randomly flipping bit 29 in the block addresses (the block
numbers are entirely valid apart from this).  This may be a disk or
controller fault, but I'd replace the cabling first.

--Stephen



Re: RAID1 experiences

1999-02-15 Thread Stephen C. Tweedie

Hi,

On Sat, 13 Feb 1999 18:14:14 -0500, Michael Stone
[EMAIL PROTECTED] said:

 On Wed, Feb 10, 1999 at 09:43:12AM -0600, Chris Price wrote:
 Instead of pointing fingers at Redhat, I would ask if there is
 someone with teh Linux-raid community that actively corresponds with
 redhat to let them know of current status of linux-raid? Ingo etal. seem
 to be doing a superb job in adding funtionality and fixing bugs quickly,
 but that does result in a myriad of patches being issued fairly
 regularily - is it Redhat's responsibility to keep track of linux-raid,
 or is it our responsibility to inform them of stable releases?

 Is anyone in the "linux-raid community" being paid to do research work
 for redhat? If so, they should probably keep redhat informed. If not, I
 think it's fair to expect redhat to do their own work.

Umm, Ingo Molnar == [EMAIL PROTECTED]  I think we can assume that there
is somebody working for Red Hat who knows a bit about the current state
of Raid. :)

However, speaking from the point of view of a kernel developer rather
than a Red Hat employee, there are real obstacles to including the new
Raid stuff in Red Hat Linux, the main one being compatibility with
existing installations using older Raid code.  I wouldn't like to be the
one trying to make Red Hat upgrades work with the new Raid drivers but
without breaking old-style Raid volumes...

--Stephen



Re: benefits of journaling for soft RAID ?

1999-02-11 Thread Stephen C. Tweedie

Hi,

On Thu, 11 Feb 1999 09:00:20 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 can someone please explain what journaling precisely does, (is this
 a sort of mechanism, which leaves the filesystem in a consistent
 status, even in case of disk write interruption, due of power loss
 or other causes ?)

Exactly.  It keeps a record of in-progress filesystem operations so
that entire complex operations, such as renames, always complete
atomically even if you reboot half-way through.  It eliminates the
need for an fsck at reboot.

 and the advantages / disadvantages ( makes filesystem slower ?),

It _should_ make it faster, for most access patterns.  It will make
"mount -o sync" operation (for things like NFS servers) *MUCH* faster,
especially if you use a separate disk for the journal.

--Stephen



Re: fsck performance on large RAID arrays ?

1999-02-10 Thread Stephen C. Tweedie

Hi,

On Tue, 9 Feb 1999 13:31:14 +0100 (CET), MOLNAR Ingo
[EMAIL PROTECTED] said:

 Stephen Tweedie is working on the journalling extensions. [not sure what
 the current status is, he had a working prototype end of last year.]

I had journaling and buffer commit code, but not any filesystem
personality stuff.  Current status is that 2.2 is out (yay!) and
journaling is once again my top priority, and I've just started doing
real testing of basic filesystem transactions (currently only for the
simplest case --- chmod --- but most of the others are relatively easy
to add once that works properly).

 AFAIK, these extensions will not destroy anything we have with ext2fs,
 they are (as usual) optional. I'd call it ext3fs too because the changes
 themselves are bigger than ext2fs itself, and together with all the other
 upcoming 2.3 features (ACLs, trees, compression, etc.) it will be
 significantly different from 'classic' ext2fs, but it's up to Stephen ... 

The development codebase is certainly being done as a separate ext3fs
but that is simply to allow me to test things without trashing the root
filesystem on all of my test boxes!  The intention is that eventually
these features should be merged into ext2 proper, but only if we can
absolutely guarantee that there will be no reliability penalty for users
not using the new code.  During the transition/testing period I'll
certainly be maintaining a test tree for ext3 as a separate filesystem
so that people don't have to put their existing data at risk

--Stephen



Re: Sun disklabels (Was: Re: RELEASE: RAID-0,1,4,5 patch...)

1999-02-04 Thread Stephen C. Tweedie

Hi,

On Thu, 28 Jan 1999 18:56:48 -0800, "David S. Miller"
[EMAIL PROTECTED] said:

 You need to start using data at cylinder 1 on all disks or it will get
 nuked.  It doesn't happen on the first disk because ext2 skips some
 space at the beginning of the volume.

 Swap space has the same problem, you cannot start it at cyliner 0.

The new-style SWAPSPACE2 avoids the first 1024 bytes in the partition
precisely because of requests such as these from Sparc people.

And yes, I agree, having the same facility for raid component partitions
would be most useful.  Adding a "start data offset" to the raid
superblock, defaulting to 0, would allow backwards compatibility too.

--Stephen



Re: Is this possible/feasible

1998-10-19 Thread Stephen C. Tweedie

Hi,

On Sun, 18 Oct 1998 23:42:39 +, "Adam Williams"
[EMAIL PROTECTED] said:

 Any pointers on where to gets doc's for this setup?

linux/Documentation/nbd.txt (surprise!) documents network block
devices.  The fact that raid may be running on nbd doesn't affect the
upper raid stuff at all.

Ingo, this is actually a problem: on nbd raid1, we *really* want read
balancing to prefer the local disk if possible!

 Does anyone know if the CODA filesystem has redundancy features?

Yes, it does, along with automatic reconciliation.  Very nice.

--Stephen



Re: Is this possible/feasible

1998-10-19 Thread Stephen C. Tweedie

Hi,

On Sun, 18 Oct 1998 15:55:35 +0200 (CEST), MOLNAR Ingo
[EMAIL PROTECTED] said:

 On Sun, 18 Oct 1998, Tod Detre wrote:

 in 2.1 kernels you can mak nfs a block device.  raid can work with block
 devices so if you raid5 several nfs computers one can go down, but you
 still can go on. 

 you probably want to use Stephen Tweedie's NBD (Network Block Device),

Heh, thanks, but the credit is Pavel Machek's.  I've just been testing
and bug-fixing it.

 which works over TCP and is such more reliable and works over bigger
 distance and larger dropped packets range. You can even have 5 disks on 5
 continents put together into a big RAID5 array. (ment to survive a
 meteorite up to the size of a few 10 miles ;) and you can loopback it
 through a crypt^H^H^H^H^Hcompression module too before sending it out to
 the net. 

Of course, you'll need to manually reconstruct the raid array as
appropriate, and you don't get raid autostart on a networked block
device either.  However, it ought to be fun to watch, and I'm hoping we
can integrate this method of operation into some of the clustering
technology now appearing on Linux to do failover of NFS services if one
of the networked raid hosts dies.  Just remount the raid on another
machine using the surviving networked disks, remount ext2fs and migrate
the NFS server's IP address: voila!

--Stephen



Re: Linear/ext2-volume question

1998-10-19 Thread Stephen C. Tweedie

Hi,

On Sun, 18 Oct 1998 12:05:11 +0100, "Johan Gronvall" [EMAIL PROTECTED]
said:

 I'm new to this list so please bare with me if I ask stupid questions.

 I'm looking for a kind of linear solution. I have however got the
 impression that you can only 'concatenate' 2 disks or partitions to
 make a single md device. Correct?

No, you can have as many as you want.

 And both disks need to be reformatted. Right?

Yes.

 If that's the case, then who's actually using linear mode? 

Me, for a start!  I found it very useful to be able to combine together
a few scraps of spare space on a number of mounted disks to create a
scratch partition of useful size.

 Anyway, I found something that was called ext2-volume, a kind of
 extension to the ext2 filesystem, that made it possible to extend a
 mounted partition on the fly! Cool, but I don't know how to build
 it. It seems that I lack a file called ext2fs.h. Anyone tried this?

Yes, it is due to be integrated into ext2 in the 2.3 kernels, but for
now I wouldn't advise using it as it lacks some fairly important things
like e2fsck. :)

--Stephen