Re: Trimming VFS inodes?

2000-07-03 Thread Richard Gooch

Alexander Viro writes:
> 
> On Wed, 14 Jun 2000, Richard Gooch wrote:

Sigh. It's taken me far to long to get back to this.

> > > Not only that, actually - order of invalidation was incorrect, IIRC.
> > 
> > Let me check I understand what you mean. You're concerned about the
> > way I *invalidate*, rather than the way I *revalidate*?
> 
> Watch for interaction between
>   * your invalidation code
>   * VFS revalidation stuff
>   * your ->revalidate()
> 
> > So, basically, the order in which I unregister devices and invalidate
> > dentries is where you see a problem?
> 
> _one_ of the problems.

I've been looking at this, and while things aren't as elegant as I'd
like, there doesn't appear to be anything harmful that could happen.
I've come up with some situations. Note that I'm assuming the Big
Kernel Lock is not present (since it's on death row).

Case 1:
Process A does unlink(2)
A gets dir->i_sem
A kills dentry
Process B tries lookup()
B waits for dir->i_sem
A drops dentry
A yields dir->i_sem
B grabs dir->i_sem
A frees old (orphaned) dentry
B goes forth in peace

No problem.

Case 2:
A does lookup()
A gets dir->i_sem
A wakes devfsd
A hashes -ve dentry and unlocks dir->i_sem
A blocks in devfs_d_revalidate_wait()
B tries unlink(2)
B is blocked in devfs_d_revalidate_wait()
devfsd does it's thing, makes inode for dentry
devfsd wakes A and B
B grabs dir->i_sem
A blocks on dir->i_sem
B unregisters device entry
B drops dentry
B yields dir->i_sem
A grabs dir->i_sem
A finds dentry still has inode, so it returns from devfs_lookup()
A does whatever it wanted to do, then returns to user-space
A frees (orphaned) dentry

No problem.

Now, I would prefer it if real_lookup() made a call to
result->d_op->d_revalidate() for the case where it calls the lookup()
method. It would allow me to clean up the devfs code a bit. It would
also look more consistent.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]



PATCH: Trying to get back IO performance (WIP)

2000-07-03 Thread Juan J. Quintela


Hi
This patch is against test3-pre2.
It gives here good performance in the first run, and very bad
in the following ones of dbench 48.  I am hitting here problems with
the locking scheme.  I get a lot of contention in __wait_on_supper.
Almost all the dbench processes are waiting in:

0xc7427dcc 0xc0116fbd schedule+0x389 (0xc4840c20, 0x12901d, 0xc7427ea0, 0x123456
7, 0xc7426000)
   kernel .text 0xc010 0xc0116c34 0xc01173c0
   0xc013639c __wait_on_super+0x184 (0xc13f4c00)
   kernel .text 0xc010 0xc0136218 0xc0136410
   0xc01523e5 ext2_alloc_block+0x21 (0xc4840c20, 0x12901d, 0xc7427ea0)
   kernel .text 0xc010 0xc01523c4 0xc015245c
0xc7427e5c 0xc0152892 block_getblk+0x15e (0xc4840c20, 0xc316b5e0, 0x9, 0x15, 0xc
7427ea0)
   kernel .text 0xc010 0xc0152734 0xc0152a68
0xc7427eac 0xc0152ed0 ext2_get_block+0x468 (0xc4840c20, 0x15, 0xc2d99de0, 0x1)
   kernel .text 0xc010 0xc0152a68 0xc0152fc0
0xc7427ef4 0xc0133ae3 __block_prepare_write+0xe7 (0xc4840c20, 0xc11f7f78, 0x0, 0
x1000, 0xc0152a68)
   kernel .text 0xc010 0xc01339fc 0xc0133bf0
0xc7427f18 0xc0134121 block_prepare_write+0x21 (0xc11f7f78, 0x0, 0x1000, 0xc0152
a68)
   kernel .text 0xc010 0xc0134100 0xc013413c
[0]more> 
0xc7427f30 0xc01531d1 ext2_prepare_write+0x19 (0xc06b4f00, 0xc11f7f78, 0x0, 0x10
00)
   kernel .text 0xc010 0xc01531b8 0xc01531d8
0xc7427f90 0xc0127b8d generic_file_write+0x305 (0xc06b4f00, 0x8050461, 0xafc2, 0
xc06b4f20)
   kernel .text 0xc010 0xc0127888 0xc0127cf0
0xc7427fbc 0xc0130ea8 sys_write+0xe8 (0x9, 0x804b460, 0xffc3, 0x28, 0x1082)
   kernel .text 0xc010 0xc0130dc0 0xc0130ed0
   0xc0109874 system_call+0x34
   kernel .text 0xc010 0xc0109840 0xc0109878


This behavior also happens with vanilla kernel, only that it is not
so easy to reproduce.  Vanilla Kernel also gives normally worse IO
throughput than the first run with this patch.

Comments/suggerences/fixes are welcome.

Later, Juan.

This patch does:
 - Introduces WRITETRY logic that means: write this buffer if that
   is possible, but never block.  If we have to block, skip the
   write. (mainly from Jens axboe)
 - Uses that logic in sync_page_buffers(), i.e. never try to block
   in writes, when writing buffers.
 - Do all the blocking/waiting calling balance_dirty() in
   shrink_mmap.
 - export the sync_page_buffers function.
 - make try_to_free_buffers never generate any IO.
 - Change the caller of that two functions accordingly with the new
   semantics.


diff -urN --exclude-from=/home/lfcia/quintela/work/kernel/exclude 
base/drivers/block/ll_rw_blk.c working/drivers/block/ll_rw_blk.c
--- base/drivers/block/ll_rw_blk.c  Fri Jun 30 18:42:30 2000
+++ working/drivers/block/ll_rw_blk.c   Mon Jul  3 00:03:00 2000
@@ -556,7 +556,7 @@
unsigned int sector, count;
int max_segments = MAX_SEGMENTS;
struct request * req = NULL;
-   int rw_ahead, max_sectors, el_ret;
+   int r_ahead, w_ahead, max_sectors, el_ret;
struct list_head *head = &q->queue_head;
int latency;
elevator_t *elevator = &q->elevator;
@@ -584,30 +584,24 @@
}
}
 
-   rw_ahead = 0;   /* normal case; gets changed below for READA */
+   r_ahead = w_ahead = 0;
switch (rw) {
case READA:
-   rw_ahead = 1;
+   r_ahead = 1;
rw = READ;  /* drop into READ */
case READ:
if (buffer_uptodate(bh)) /* Hmmph! Already have it */
goto end_io;
kstat.pgpgin++;
break;
-   case WRITERAW:
-   rw = WRITE;
-   goto do_write;  /* Skip the buffer refile */
+   case WRITETRY:
+   w_ahead = 1;
case WRITE:
if (!test_and_clear_bit(BH_Dirty, &bh->b_state))
goto end_io;/* Hmmph! Nothing to write */
refile_buffer(bh);
-   do_write:
-   /*
-* We don't allow the write-requests to fill up the
-* queue completely:  we want some room for reads,
-* as they take precedence. The last third of the
-* requests are only for reads.
-*/
+   case WRITERAW:  /* Skip the buffer refile */
+   rw = WRITE;
kstat.pgpgout++;
break;
   

Re: PATCH: Trying to get back IO performance (WIP)

2000-07-03 Thread Stephen C. Tweedie

Hi,

On Mon, Jul 03, 2000 at 02:24:07AM +0200, Juan J. Quintela wrote:

> This patch is against test3-pre2.
> It gives here good performance in the first run, and very bad
> in the following ones of dbench 48.  I am hitting here problems with
> the locking scheme.  I get a lot of contention in __wait_on_supper.
> Almost all the dbench processes are waiting in:
> 
>0xc013639c __wait_on_super+0x184 (0xc13f4c00)
>0xc01523e5 ext2_alloc_block+0x21 (0xc4840c20, 0x12901d, 0xc7427ea0)

Known, and I did a patch for this ages ago.  It actually didn't make a
whole lot of difference.  The last version of the ext2 diffs I did for
this are included below.

--Stephen



>From [EMAIL PROTECTED]  Mon May  1 18:18:15 2000
Return-Path: 
Received: (from sct@localhost)
by dukat.scot.redhat.com (8.9.3/8.9.3) id SAA16268
for [EMAIL PROTECTED]; Mon, 1 May 2000 18:18:14 +0100
Resent-From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Resent-Message-ID: <[EMAIL PROTECTED]>
Resent-Date: Mon, 1 May 2000 18:18:14 +0100 (BST)
Resent-To: [EMAIL PROTECTED]
X-Authentication-Warning: worf.scot.redhat.com: sct set sender to 
[EMAIL PROTECTED] using -f
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]>
X-Mailer: VM 6.59 under Emacs 20.3.1
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Linus Torvalds <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>, "Stephen C. Tweedie" <[EMAIL PROTECTED]>,
[EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: DANGER: DONT apply it. Re: [patch] ext2nolock-2.2.8-A0
Date: Sat, 15 May 1999 03:05:41 +0100 (BST)
Status: RO
Content-Length: 22531
Lines: 755

Hi,

Linus Torvalds writes:

 > In particular, ext2_new_block() returns a block number. That's all fine
 > and dandy, but it's extremely and utterly idiotic. It means that
 > ext2_new_block() needs to clear the block, which is where all the
 > race-avoidance comes from as far as I can tell. 

There are two sets of races involved.  One involves consistency of the
bitmaps themselves: we need to make sure that concurrent allocations
are consistent.  The second is consistency of inodes, and that is
already handled in fs/ext2/inode.c: when we call ext2_alloc_block(),
we follow it up with a check to see if anybody else has allocated the
same block in the same inode while we slept.  If so, we free the block
and repeat the block search.

So, there's no reason why block clearing needs to be locked: the block
cannot be installed in the inode indirection maps until we have
completely returned from ext2_new_block() anyway.

Anyway, the patch below (against 2.2.8) is an extension of the last
one I posted, and it gets rid of the superblock lock in balloc.c
entirely.  We do have to be a bit more careful about allocation
consitency without the lock: in particular, quota operations used to
happen between bitmap and group descriptor updates, but now we can't
risk blocking while the bitmaps and group descriptors are
inconsistent.

It compiles but I can't test right now.  Feel free to play with it if
you want to, but I do believe we can safely go without superblock
locks in both ialloc.c and balloc.c if we are careful.

The main change that dropping the lock imposes on us is that we cannot
rely on the bitmap buffer remaining pinned in cache for the duration
of the allocation, so we have to bump the buffer_head b_count in
load_block_bitmap() and brelse it once we are finished with it.
Shift-scrlck will be very useful here for spotting the buffer_head
leaks I've introduced. :)

--Stephen

alloc-2.2.8.diff:

--- fs/ext2/balloc.c.~1~Thu Oct 29 05:54:56 1998
+++ fs/ext2/balloc.cFri May 14 14:46:59 1999
@@ -9,6 +9,13 @@
  *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
  *  Big-endian to little-endian byte-swapping/bitmaps by
  *David S. Miller ([EMAIL PROTECTED]), 1995
+ *
+ *  Dropped use of the superblock lock.
+ *  This means that we have to take extra care not to block between any
+ *  operations on the bitmap and the corresponding update to the group 
+ *  descriptors.
+ *   Stephen C. Tweedie ([EMAIL PROTECTED]), 1999
+ * 
  */
 
 /*
@@ -74,42 +81,33 @@
 }
 
 /*
- * Read the bitmap for a given block_group, reading into the specified 
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group.
  *
- * Return >=0 on success or a -ve error code.
+ * Return the buffer_head on success or NULL on IO error.
  */
 
-static int read_block_bitmap (struct super_block * sb,
-  unsigned int block_group,
-  unsigned long bitmap_nr)
+static struct buffer_head * read_block_bitmap (struct super_block * sb,
+ 

Too many contention in __wait_on_super

2000-07-03 Thread Juan J. Quintela


Hi 

   I have been changing the write of buffers in the IO layer, and I
have found that the system gets a lot of contention in
__wait_on_super().  I am using test3-pre1 + kdb patch. I see also the
stalls/vmstat strange output without the kdb patch.  I use it to be
able to see the back-traces.  You can also look at the vmstat output.
I see that vmstat 1 output when running dbench 48, the results doesn't
make sense that we are almost not doing IO in an IO test :

vmstat output:
 r  b  w   swpd   free   buff  cache  si  sobibo   incs  us  sy  id
 0 48  2   5004   3112   1948 112652   0   0 0   151  221   238   2   1  96
 0 48  2   5004   3108   1948 112652   0   0 0   133  195   188   0   0  99
 0 48  2   5004   3108   1948 112652   0   0 0   141  192   184   0   0  99

when things are working properly, that field is between 1000-2000.

Now the traces:

Almost all the dbench processes are stuck with the first back-trace.

Someone find some sense to that output and/or know how to fix that
problem.

Any comment are welcome.  If you need more information, let me know.

Later, Juan.


Most common back-trace:

0xc721bdcc 0xc0116fbd schedule+0x389 (0xc6e3bb60, 0x98e01, 0xc721bea0, 0x1234567
, 0xc721a000)
   kernel .text 0xc010 0xc0116c34 0xc01173c0
   0xc01363ac __wait_on_super+0x184 (0xc13f4c00)
   kernel .text 0xc010 0xc0136228 0xc0136420
   0xc01523f5 ext2_alloc_block+0x21 (0xc6e3bb60, 0x98e01, 0xc721bea0)
   kernel .text 0xc010 0xc01523d4 0xc015246c
0xc721be5c 0xc01528a2 block_getblk+0x15e (0xc6e3bb60, 0xc2f75c20, 0x8d, 0x99, 0x
c721bea0)
   kernel .text 0xc010 0xc0152744 0xc0152a78
0xc721beac 0xc0152ee0 ext2_get_block+0x468 (0xc6e3bb60, 0x99, 0xc221e440, 0x1)
   kernel .text 0xc010 0xc0152a78 0xc0152fd0
0xc721bef4 0xc0133af3 __block_prepare_write+0xe7 (0xc6e3bb60, 0xc11c1ee8, 0x0, 0
x1000, 0xc0152a78)
   kernel .text 0xc010 0xc0133a0c 0xc0133c00
0xc721bf18 0xc0134131 block_prepare_write+0x21 (0xc11c1ee8, 0x0, 0x1000, 0xc0152
a78)
   kernel .text 0xc010 0xc0134110 0xc013414c
0xc721bf30 0xc01531e1 ext2_prepare_write+0x19 (0xc719b5e0, 0xc11c1ee8, 0x0, 0x10
00)
   kernel .text 0xc010 0xc01531c8 0xc01531e8
0xc721bf90 0xc0127b8d generic_file_write+0x305 (0xc719b5e0, 0x8054469, 0x6fba, 0
xc719b600)
   kernel .text 0xc010 0xc0127888 0xc0127cf0
0xc721bfbc 0xc0130ea8 sys_write+0xe8 (0x9, 0x804b460, 0xffc3, 0x28, 0x1092)
   kernel .text 0xc010 0xc0130dc0 0xc0130ed0
   0xc0109874 system_call+0x34
   kernel .text 0xc010 0xc0109840 0xc0109878

or in:
0xc320de58 0xc0116fbd schedule+0x389 (0xc4c130c0, 0xc13f4c00, 0xc4c138e0, 0x1234
567, 0xc320c000)
   kernel .text 0xc010 0xc0116c34 0xc01173c0
   0xc01363ac __wait_on_super+0x184 (0xc13f4c00)
   kernel .text 0xc010 0xc0136228 0xc0136420
   0xc0151ce9 ext2_new_inode+0x6d (0xc4c130c0, 0x8180, 0xc320df00)
   kernel .text 0xc010 0xc0151c7c 0xc01521dc
0xc320df04 0xc0154555 ext2_create+0x1d (0xc4c130c0, 0xc5c17b60, 0x8180)
   kernel .text 0xc010 0xc0154538 0xc01545f4
0xc320df28 0xc013df64 vfs_create+0xdc (0xc4c130c0, 0xc5c17b60, 0x180)
   kernel .text 0xc010 0xc013de88 0xc013dfcc
0xc320df58 0xc013e198 open_namei+0x1cc (0xc4c12000, 0x243, 0x180, 0xc320df7c)
   kernel .text 0xc010 0xc013dfcc 0xc013e6cc
0xc320df98 0xc0130376 filp_open+0x3a (0xc4c12000, 0x242, 0x180)
   kernel .text 0xc010 0xc013033c 0xc0130398
0xc320dfbc 0xc01306dc sys_open+0x68 (0xbc4c, 0x242, 0x180, 0x242, 0xbc4c
)
   kernel .text 0xc010 0xc0130674 0xc01307dc
   0xc0109874 system_call+0x34
   kernel .text 0xc010 0xc0109840 0xc0109878


or in:
0xc5a19e50 0xc0116fbd schedule+0x389 (0xc5b58a00, 0x1000, 0xfff4, 0xc5a18000
, 0x1234567)
   kernel .text 0xc010 0xc0116c34 0xc01173c0
   0xc0131db8 __wait_on_buffer+0x1d8 (0xc5b58a00)
   kernel .text 0xc010 0xc0131be0 0xc0131e34
   0xc0133773 unmap_buffer+0x33 (0xc5b58a00)
   kernel .text 0xc010 0xc0133740 0xc01337a0
0xc5a19eb8 0xc01338e1 unmap_underlying_metadata+0x29 (0xc536fc00)
[1]more> 
   kernel .text 0xc010 0xc01338b8 0xc01338f0
0xc5a19ef4 0xc0133b0f __block_prepare_write+0x103 (0xc62c25e0, 0xc110e680, 0x0, 
0x1000, 0xc0152a78)
   

Re: [Charon-dev] Re: VFS not completely factored, and more

2000-07-03 Thread Alexander Viro



[reformatted]
On Mon, 26 Jun 2000, Michael W Zappe wrote:

[snip]
> filesystem, CXFS.  (originally named Charon, but we discovered two
> companies warring over the trademark, and didn't want to touch that with
> a 40 foot pole... ;-)

Heh. XFS folks mentioned clustered variant of their puppy. Yup, also
CXFS...

> But in terms of the VFS problems, I have to say that your idea is a good
> one, since it keeps the data local to the inode!  The only issue i'd
>have with it is the way you do the define.  I think a much cleaner
>solution would be to have each fs define a structure
> 
> struct ext2_inode
> {
>   struct inode ei_inode;
>   struct ext2_inode_info ei_ext2_info;
> };
> 
> And just clean up all of the other filesystems to use a more generic
>mechanism to cast it, such as:
> 
> #define EXT2_INODE_INFO(inode) (((struct ext2_inode *)(inode))->ei_ext2_info)

Umm... I don't like how it sounds. Basically, it means that iget() goes
to hell - that way allocation has to be done in fs. May be a good thing,
but we are not there yet.

> This would also mean that if anything changed, we have an inexpensive
> (non-existent at runtime) level of abstraction which would allow for a
> simple change.  I also use a similar mechanism, and it works great for
> portability.  I'd definately be willing to help clean up the FS code to
> use the new interface.

Could you describe that new interface in some details? I don't see how you
deal with knfsd and with allocation/freeing policy. Another thing to watch
for: init_once()-type animals.

> I also wouldn't mind adding changing the iget code to a new iget2 which
> takes an opaque, fs dependent parameter (void *).  We would also split

Not funny. First of all, you have to deal with callers in knfsd. If we
kill them (fh<->dentry patches) there is no reason to keep iget() and
->read_inode() at all. Otherwise... Where the heck will knfsd get your
void * cookie?

> the functions into 
> 
> void read_inode(struct inode *, void *);
> void get_inode(struct inode *, void *);
> 
> Read inode is called only on an uninitialized inode structure, and
> get_inode whenever iget pulls an inode out of the cache.  This way, if
> the FS isn't concerned with additional reads after the first, the
> pointer can just be NULL and we optimize out the call.

??? That's something new - which filesystems need to do reread upon the
icache hit?

> Also, i wouldn't mind changing i_ino to be a __u64.  Anyone see any
> major problems with this?  We'd have to change the hashing code, (which
> wouldn't be a bad idea anyway) but i think most everything else will
> convert fairly cleanly.  I have to look at this change a little more
> closely, but it seems to be a good idea.

No problem, except that you are getting an overhead on comparisons in the
cache. My gut feeling is that it's not going to be critical, but there's
only one way to figure out... Keep in mind that x86 is register-starved,
so anything 64-bit is a strain on compiler. I suspect that it may turn
into worse overhead than anything coming from separate allocation of
per-fs part.