ext3 for 2.4

2001-05-17 Thread Andrew Morton

Summary: ext3 works, page_launder() doesn't :)



The tree is based on the porting work which Peter Braam did.  It's
in cvs in Jeff Garzik's home on sourceforge.  Info on CVS is at
http://sourceforge.net/cvs/?group_id=3242 - the module name
is `ext3'.  There's a README there which describes how to
apply the patchset.

Current status is: quite solid.  Stress testing on x86/SMP
passes and performance in ordered data and writeback data
mode is good.  Journalled data performance is, of course, so-so.
The only big issue of which I am aware is a VM livelock
on SMP, discussed below.

The patch is against 2.4.4-ac9.

Today's changes:

- quotas appear to work OK.  I'll leave them turned on
  as I test things, and watch out for oddities.

  It's hard to find working quota tools.  Most of them
  either don't want to compile and/or don't understand
  ext3.  Jan Kara is maintaining a set of quota tools
  at http://www.sourceforge.net/projects/linuxquota/ which
  work well.  The current CVS tree from there seems to be
  under XFS development at present and needs a couple of
  patches to work against ext3 (and even ext2).  I can send them
  to whoever needs.

- Recovery works fine now.  The bug was that I was splicing new
  blocks into a file in ext3_splice_branch() *before* doing a
  journal_get_write_access() on its parent's buffer.  Duh.

- Four debugging fields have been removed from buffer_head.
  b_alloc_transaction, etc.   These were debug fields which
  I couldn't find a use for in 2.4.  In 2.2, these were set
  in ext3_new_block() when we do a getblk() on the new block.
  In 2.4, we don't do the getblk() any more...

- Some tightening of the way commit feeds buffers into the 
  request queues.  At present, 256 buffers are fed into
  ll_rw_block() before we run tq_disk.  I *was* pushing
  thousands down.  It doesn't seem to make much difference.
  Overall throughput with some benchmarks in ordered data
  mode has been significantly improved by this change.
  ext3 in general seems faster in 2.4 than in 2.2, presumably
  because of better request merging.

  Much more work needs to go into benchmarking and performance
  tuning.

- There's an issue with page_launder():

ext3_file_write()
- generic_file_write()
   - __alloc_pages()
  - page_launder()
 - ext3_writepage()

  This is bad.  It will cause ext3 to be reentered while it
  has a transaction open against a different fs.  This will
  corrupt filesystems and can deadlock.

  Making ext3_file_write() set PF_MEMALLOC wasn't suitable.  It
  easily causes 0-order allocation failures within generic_file_write().

  The current approach to this is, in ext3_writepage(), to detect
  when ext3 is being reentered and to simply *return* without
  writing the page at all.

  This is kludgy but should work - the only place where the fs can
  be reentered via writepage() is from page_launder(), and
  page_launder() doesn't wait on the page. Quotas don't use
  writepage(), and reentry there is OK.

  If Marcelo's `priority' argument to writepage() goes in,
  this can be used in a more sensible manner.

  Note that this return-if-reentered code is not related to
  the VM livelock.  It has a big printk in it at present..

- Some new test tools:

  To simulate crashes I have added a new mount option:

mount /dev/foo /mnt/bar -t ext3 -o ro-after=NNN

  When the fs is mounted this way a timer will fire after
  NNN jiffies and will turn the underlying device immutable.
  It does this by setting a flag which is tested in submit_bh().
  For WRITE requests submit_bh() will simply call
  bh_end_io(uptodate=1) and return.

  There's a new ext3 ioctl() which will block the caller until the
  device has gone readonly.  I semi-randomly chose

#define EXT3_WAIT_FOR_READONLY _IOR('w', 1, long)


  The intent here is that a controlling script will:

1: Mount the fs with ro-after=1000  (Ten seconds)
2: Start a test script (eg: dbench)
3: Block on the wait-for-readonly ioctl
4: wake up when the disk has crashed
5: Kill off the test script
6: Unmount the fs
7: Mount the fs (let recovery run)
8: unmount the fs
9: run e2fsck to check that the fs is sane
10: modify the ro-after parameter
11: do it all again

  Scripts which do all this are in the testing/ and tools/
  directories.  I've been happily simulating crashes in
  the middle of `dbench 12' runs for an hour now.  All is well.

  I think this covers everything except for verifying that the
  data content of the files are sane.  That can be handled with
  test tools.  Special code will probably be needed to simulate
  crashes during truncate - with this shotgun approach the fs
  tends to go immutable before *any* of the truncate has committed,
  and it's as if nothing ever happened.

  The `ro-after' code and submit_bh() changes are conditional
  on CONFIG_JBD_DEBUG.



So.  

Re: ext3 for 2.4

2001-05-17 Thread Andrew Morton

Andrew Morton wrote:
 
 The tree is based on the porting work which Peter Braam did.  It's
 in cvs in Jeff Garzik's home on sourceforge.  Info on CVS is at
 http://sourceforge.net/cvs/?group_id=3242 - the module name
 is `ext3'.

That was a bit cryptic.

cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/gkernel login
Hit enter for passwd
cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/gkernel co ext3

Also, there's a silly bug which crashes things if the patch is applied,
but you select CONFIG_EXT3_FS=n.  Don't do that - just back everything
out.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: ext3 for 2.4

2001-05-17 Thread Andrew Morton

Daniel Phillips wrote:
 
 And the third is a combination of two patches:
 
   ftp://ftp.math.psu.edu/pub/viro/ext2-dir-patch-S4.gz
   http://nl.linux.org/~phillips/htree/dx.pcache-2.4.4-6
 

These changes have a very low impact on the journalling code,
and vice versa.  A few days effort to merge them once ext3/2.4
is steady.  And once the pcache stuff is steady: the journalling
code is pretty complex.

It's probably worth thinking about adding a fourth journalling
mode: `journal=none'.  Roll it all up into a single codebase
and call it ext4.

It rather depends on where the buffercache ends up.  ext3 is
a client of JBD (nee JFS).  JBD does *block* level journalling.
Any major change at that level will take rather some adjusting
to.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: ext3 for 2.4

2001-05-17 Thread Andreas Dilger

Andrew writes:
 It's probably worth thinking about adding a fourth journalling
 mode: `journal=none'.

Yes, I had added this (at least in skeleton form) in my ext3 tree.
If only I could keep up with you and Daniel for both the ext3 and
indexed directory stuff, I might be able to submit it...

Basically, I wrapped all of the ext3 journal operations in an inline
ext3_journal_op and did nothing if handle was NULL (except
for the journal_dirty_metadata, which went to mark_buffer_dirty().
You could also just make them (mostly) no-ops if CONFIG_EXT3_FS
was not defined and we were living in the ext2 tree.

We may want to keep the orphan list handling even for ext2, because
e2fsck does orphan cleanup regardless of whether the filesystem has
a journal, and it would the amount of output from e2fsck. Doesn't
matter much either way.

 Roll it all up into a single codebase and call it ext4.

Rather just stick with ext3 for now.  Don't want to confuse
the issue even more.  If we can get the ext3 code mounting ext2
filesystems (i.e. without a journal), then we can slowly merge
the changes back to stock ext2 surrounded by CONFIG_EXT3_FS
(or not, as Linus dictates).

 It rather depends on where the buffercache ends up.  ext3 is a
 client of JBD (nee JFS).  JBD does *block* level journalling.  Any
 major change at that level will take rather some adjusting to.

Well, Daniel's part of the code still uses buffer_heads, but they
are backed by the page cache and not the buffer cache.  This is
the direction Linus wants to go (AFAICS), that we use the page
cache for cacheing, and buffer_heads for I/O handles only.  Al's
page-cache directory stuff does not use Daniel's buffer_head
abstraction, so it may need to get changed a bit to work with JBD.

At this point, it is probably worth doing a global search-and-replace
for all of the jfs_* functions, and rename them jbd_*, to avoid
conflicts with IBM JFS.

Cheers, Andreas
-- 
Andreas Dilger  \ If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: ext3 for 2.4

2001-05-17 Thread Daniel Phillips

On Thursday 17 May 2001 17:53, Andrew Morton wrote:
 It's probably worth thinking about adding a fourth journalling
 mode: `journal=none'.  Roll it all up into a single codebase
 and call it ext4.

Or ext5 (= ext2 + ext3).

 It rather depends on where the buffercache ends up.  ext3 is
 a client of JBD (nee JFS).  JBD does *block* level journalling.
 Any major change at that level will take rather some adjusting
 to.

Well, if you look how I did the index, it works with blocks and buffers 
while still staying entirely in the page cache.  This was Stephen's 
suggestion, and it integrates reliably with Al's page-oriented code.  
So I'm mixing pages and blocks together and it's working pretty well.  
BTW, the parts of Al's patch that I converted from pages to blocks got 
shorter and easier to read.

I'm now working on some code to handle non-data blocks in a similar 
way, so if this works out it could make the conversion an awful lot 
less painful for you.

--
Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: ext3 for 2.4

2001-05-17 Thread Jeff Garzik

AFAIK the original stated intention of ext3 was

cd linux/fs
cp -a ext2 ext3
# hack on ext3

That leaves ext2 in ultra-stability,
no-patches-unless-absolutely-necessary mode.

IMHO prove a new feature, like directories in page cache, journaling,
etc. in ext3 first.  Then maybe after a year of testing, if people
actually care, backport those features to ext2.

-- 
Jeff Garzik  | Game called on account of naked chick
Building 1024|
MandrakeSoft |
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



Re: [Ext2-devel] Re: ext3 for 2.4

2001-05-17 Thread Theodore Tso

On Thu, May 17, 2001 at 03:00:28PM -0400, Jeff Garzik wrote:
 AFAIK the original stated intention of ext3 was
 
   cd linux/fs
   cp -a ext2 ext3
   # hack on ext3
 
 That leaves ext2 in ultra-stability,
 no-patches-unless-absolutely-necessary mode.
 
 IMHO prove a new feature, like directories in page cache, journaling,
 etc. in ext3 first.  Then maybe after a year of testing, if people
 actually care, backport those features to ext2.

Alternatively, once we get ext3 with just journaling stable (and with
an option to not do journaling at all), simply do something like this:

cd linux/fs
rm -f ext2
mv ext3 ext2
cp -r ext2 ext3
# hack hack hack on ext3 and add even more features

So ext3 is always the development version, and ext2 is the stable
version.

- Ted





-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]



quota tools (was Re: ext3 for 2.4)

2001-05-17 Thread Nathan Scott

hi,

On May 17,  9:20pm, Andrew Morton wrote:
 Subject: ext3 for 2.4
 ...
 - quotas appear to work OK.  I'll leave them turned on
   as I test things, and watch out for oddities.
 
   It's hard to find working quota tools.  Most of them
   either don't want to compile and/or don't understand
   ext3.  Jan Kara is maintaining a set of quota tools
   at http://www.sourceforge.net/projects/linuxquota/ which
   work well. 

Yes, that's your best bet for working quota tools - they
are being maintained by both Jan and Marco.

 The current CVS tree from there seems to be
   under XFS development at present and needs a couple of

That's not quite correct - the XFS development here was
done awhile ago now, and has been bug fixes only for some
time.

   patches to work against ext3 (and even ext2).  I can send them
   to whoever needs.

Send them to Jan and/or Marco (both cc'd) and they'll very
quickly show up in cvs.

cheers.

-- 
Nathan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]