Re: updates to procfs?

1999-10-14 Thread Alexander Viro



On Wed, 13 Oct 1999, Jeff Garzik wrote:

> Some on linux-kernel mentioned that procfs needed cleanup.  Is there a
> TODO list somewhere?

Even an initial variant of patch (ouch... porting the thing from
2.3.13-pre1 to 2.3.22-pre2 _did_ hurt; damn CVS...).

I can't promise that in the current state it _builds_, let alone works -
there were changes and testing is in order.

Sore points (partially addressed in the patch):
a) way too many places know about the layout of proc_dir_entry.
In most cases they don't need it - dynamical creation works quite fine.
I've added several functions (create_proc_*entry/remove_proc_*entry) and I
think that they should cover almost everything. See the patch for
examples of usage.
b) many moons ago, when procfs was small it was using an array of
in-core structures that imitated on-disk inodes. Some inumbers were used
by per-process stuff, some - by (then very few) special files (a-la
/proc/loadavg, etc.). Methods used to be switches. Well, it didn't scale,
so we went for the dynamic inumbers. But the old code stayed around and
some of the new procfs files went there too. Some even went into switches.
Resulting mess was _not_ nice. I tried to clean that up.
c) after the switches period the interface went through several
changes. We'ld better unify this stuff. I almost didn't touch this thing.
d) many drivers failed to unregister the entries. Each of those
cases is oopsable bug. I've fixed several such animals, but I suspect that
some are still lurking.
e) we don't need constant inumbers in procfs. If we will finally
get rid of them we will be able to simplify the permission tests in
per-process part and close several nasty holes with stale dentries.
f) proc_unregister() needs some form of revoke(). It even
implements something, but I suspect that it's racey.
g) proc/mem.c is a living horror. Look at it and you'll see.
h) per-process part really ought to be separated (codewise) from
the rest. Different needs, different races, etc...
i) tons of additional fun.

IMO the first thing that should be done is the exorcism - knowledge of
procfs guts should be driven out of the rest of tree. Then we will be able
to deal with procfs problems not causing the massive changes in the
$BIGNUM drivers. Until that will be done we can't do much with procfs
proper. 
I've dropped the patch (against 2.3.22-pre2) on the
ftp.math.psu.edu/pub/viro/proc-patch-22-2.gz. Feel free to look it
through/try to test it. Comments/questions/flames to [EMAIL PROTECTED]
Cheers,
Al



Re: (reiserfs) Re: journal requirements for buffer.c

1999-10-14 Thread Stephen C. Tweedie

Hi,

On Thu, 14 Oct 1999 14:31:23 +0400, Hans Reiser <[EMAIL PROTECTED]> said:

> Ah, I see, the problem is that when you batch the commits they can be
> truly huge, and they all have to commit for any of them to commit, and
> none of them can be flushed until they all commit, is that it?

Exactly.  And the worst part of it is that while the transactions are
still growing and atomic filesystem operations are still running, you
can't even tell for sure exactly how big the transaction is going to
get eventually.

--Stephen



Re: journal requirements for buffer.c

1999-10-14 Thread Hans Reiser

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Wed, 13 Oct 1999 02:19:19 +0400, Hans Reiser <[EMAIL PROTECTED]> said:
>
> > I merely hypothesize that the maximum value of required
> > FLUSHTIME_NON_EXPANDING will usually be less than 1% of memory, and
> > therefor won't have an impact.  It is not like keeping 1% of memory
> > around for use by text segments and other FLUSHTIME_NON_EXPANDING
> > buffers is likely to be a bad thing.
>
> That's probably enough for journaled filesystems, but with deferred
> allocation it definitely is not.  If you have a lot of data to commit,
> then I guess that the tree operations required to push many tens of MB
> of data to disk could well exceed that 1%.

Ah, I see, the problem is that when you batch the commits they can be truly
huge, and they all have to commit for any of them to commit, and none of them
can be flushed until they all commit, is that it?

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.





RE: (reiserfs) Re: journal requirements for buffer.c

1999-10-14 Thread Chris Mason



> -Original Message-
> From: Stephen C. Tweedie [mailto:[EMAIL PROTECTED]]
> On Wed, 13 Oct 1999 09:55:39 -0400, Chris Mason
> <[EMAIL PROTECTED]> said:
>
> > All true.  But shouldn't I be able to write function to reuse a
> buffer_head
> > for a different block without freeing it?  I realize the buffer cache
> > doesn't have a call to do it now, but it seems like it should
> be possible.
>
> Sort of: you can definitely reuse the buffer data, but you almost
> certainly need a new buffer_head with which to label it when it goes to
> the log (ext3 tries to do that whenever possible).  However, there is
> still a tradeoff: if you have the buffers shared, then no new
> transaction can modify the buffer contents while a commit occurs.  If
> you allow copy-on-write so that the commit can proceed while a new
> transaction dirties the buffer, then once again you are requiring extra
> memory allocation during the commit.
>
Sorry, I don't think I'm explaining my intentions well.  Since I'm not sure
it could work, and the reserved memory ideas solve the problem better (what
I'm think of will be very slow), I'll move on to other things ;-)

> However, the other part of the equation: the fact that you don't know in
> advance how large a running transaction will become, and that its
> buffers are pinned until the transaction completes and starts to commit
> --- is much harder to work around.
>

The best we have is an upper bound, and the lack of transaction handles
makes mine less accurate.  Once I'm stable in 2.3, I'll probably add the
handles...they will make a reserved memory setup easier to use.

-chris



Re: journal requirements for buffer.c

1999-10-14 Thread Stephen C. Tweedie

Hi,

On Wed, 13 Oct 1999 02:19:19 +0400, Hans Reiser <[EMAIL PROTECTED]> said:

> I merely hypothesize that the maximum value of required
> FLUSHTIME_NON_EXPANDING will usually be less than 1% of memory, and
> therefor won't have an impact.  It is not like keeping 1% of memory
> around for use by text segments and other FLUSHTIME_NON_EXPANDING
> buffers is likely to be a bad thing.

That's probably enough for journaled filesystems, but with deferred
allocation it definitely is not.  If you have a lot of data to commit,
then I guess that the tree operations required to push many tens of MB
of data to disk could well exceed that 1%.

>> It should definitely be possible to establish a fairly clean common
>> kernel API for this.  Doing so would have the extra advantage that if
>> you had mixed ReiserFS and XFS partitions on the same machine, the
>> VM's memory reservation would be able to cope cleanly with multiple
>> users of reserved memory.

> Ok, so we agree that we need it, and the details we are still refining.

Yes.

--Stephen



RE: (reiserfs) Re: journal requirements for buffer.c

1999-10-14 Thread Stephen C. Tweedie

Hi,

On Wed, 13 Oct 1999 09:55:39 -0400, Chris Mason
<[EMAIL PROTECTED]> said:

> All true.  But shouldn't I be able to write function to reuse a buffer_head
> for a different block without freeing it?  I realize the buffer cache
> doesn't have a call to do it now, but it seems like it should be possible.

Sort of: you can definitely reuse the buffer data, but you almost
certainly need a new buffer_head with which to label it when it goes to
the log (ext3 tries to do that whenever possible).  However, there is
still a tradeoff: if you have the buffers shared, then no new
transaction can modify the buffer contents while a commit occurs.  If
you allow copy-on-write so that the commit can proceed while a new
transaction dirties the buffer, then once again you are requiring extra
memory allocation during the commit.  

However, the other part of the equation: the fact that you don't know in
advance how large a running transaction will become, and that its
buffers are pinned until the transaction completes and starts to commit
--- is much harder to work around.

--Stephen




Announce: limited user mode tools for ext3-0.0.2

1999-10-14 Thread Stephen C. Tweedie

Hi,

To follow up on the kernel announce of the ext3-0.0.2 snapshot, there
are a couple of tools available for helping with migrating to/from ext3.
In particular, the current e2fsprogs work-in-progress snapshot at:

http://web.mit.edu/tytso/www/linux/dist/e2fsprogs-1.16-WIP.tar.gz

has support in debugfs for clearing out the ext3 journal flags to allow
you to perform an fsck.  To fsck an ext3 filesystem, you _must_ have a
filesystem which has been unmounted cleanly (or remounted read-only), as
currently only the kernel understands how to recover the journal on an
uncleanly-dismounted filesystem.  You can then use debugfs from the new
e2fsprogs to clear the "HAS_JOURNAL" flag on the filesystem:

[root@sarek /root]# debugfs
debugfs 1.16-WIP, 15-Sep-1999 for EXT2 FS 0.5b, 95/08/09
debugfs:  open -f -w /dev/sda2
debugfs:  features
Filesystem features: has_journal sparse_super
debugfs:  features -has_journal
Filesystem features: sparse_super
debugfs:  quit
[root@sarek /root]# 
   
You now have a normal ext2 filesystem, and you can e2fsck it as usual.
To remount it as ext3, simply use the same call you used to set up the
filesystem in the first place, ie. "mount /dev/sda2 /mnt/test -o
journal=xxx". 

To allow a journal to be added to a root filesystem, a new init flag has
been added to the kernel in the ext3-0.0.2 release: you can specify
"rootflags=xxx" and the xxx will be passed to "mount" when the root
filesystem is mounted.  If you want to set up an ext3 root filesystem
with a journal on inode 1234, you can pass kernel command line
parameters "rw rootflags=journal=1234" to cause the kernel to mount root
read-write as ext3 and create the specified journal.

More and better documentation is on the wish-list for ext3-0.0.3. 

--Stephen



Announce: ext2+journaling, release 0.0.2

1999-10-14 Thread Stephen C. Tweedie

Hi all,

OK, a couple of weeks later than I'd hoped and massive numbers of
bug-fixes further on, ext3-0.0.2 is out.  

This is the first usable release.  Apart from the critical failure
handling (handling of IO errors or memory allocation failures), this is
the first solid version of journaled ext2.  

The on-disk format is not yet finalised: there will be format changes to
the journal in the future, but you will always have an upgrade migration
path involving backing off from ext3 and then re-upgrading to the new
ext3.  So go out and hammer on it.  (No, *not* on your production web
server --- not just yet!)

Find it at

ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ext3-0.0.2.tar.gz

Cheers,
 Stephen



Changes in this release
---

Bug fixes.  Lots of bug fixes.  Buckets of them.

It works on >1K blocksize filesystems.  It recovers reliably.  It
survives log wraps properly during recovery.  mknod() works properly: it
will no longer turn /dev into a socket if used on your root filesystem.  

This one survives under load quite happily.  A 50-client dbench run
completes reliably.

So basically, this is the first usable ext3 release.

Note that there are two major places where the implementation is not
complete: clean handling of all errors (in particular out-of-memory and
IO errors), and performance (there is still a lot of debugging code in
place, and all data is journaled as part of the testing cycle).  But it
is usable: I've been running it on all of my laptop's filesystems for
over a week now.





Re: [patch] [possible race in ext2] Re: how to write get_block?

1999-10-14 Thread tytso

   From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
   Date:   Mon, 11 Oct 1999 17:34:36 +0100 (BST)

   The _fast_ quick fix is to maintain a per-inode list of dirty buffers
   and to invalidate that list when we do a delete.  This works for
   directories if we only support truncate back to zero --- it obviously
   gets things wrong if we allow partial truncates of directories (but why
   would anyone want to allow that?!)

   This would have minimal performance implication and would also allow
   fast fsync() of indirect block metadata for regular files.

I've actually had patches to do fast fsync for some time, but I
thought you said you had some changes in queue which would make the
need for this obsolete, so didn't bother to submit this, since it is a
bit of a hack.  Here it is though, for folks to comment upon.  Code to
invalidate the list of dirty buffers is missing from this code, but it
wouldn't be hard to add.

- Ted

Patch generated: on Wed Aug 18 15:01:49 EDT 1999 by [EMAIL PROTECTED]
against Linux version 2.2.10
 
===
RCS file: fs/ext2/RCS/fsync.c,v
retrieving revision 1.1
diff -u -r1.1 fs/ext2/fsync.c
--- fs/ext2/fsync.c 1999/07/21 10:45:22 1.1
+++ fs/ext2/fsync.c 1999/07/21 10:45:26
@@ -250,36 +250,83 @@
return err;
 }
 
-/*
- * File may be NULL when we are called. Perhaps we shouldn't
- * even pass file to fsync ?
- */
-
-int ext2_sync_file(struct file * file, struct dentry *dentry)
+static int sync_inode(struct inode *inode)
 {
-   int wait, err = 0;
-   struct inode *inode = dentry->d_inode;
-
+   int i, wait, err = 0;
+   __u32   *blklist;
+   
if (S_ISLNK(inode->i_mode) && !(inode->i_blocks))
-   /*
-* Don't sync fast links!
-*/
-   goto skip;
+   return 0;
 
-   for (wait=0; wait<=1; wait++)
-   {
-   err |= sync_direct (inode, wait);
-   err |= sync_indirect (inode,
+   if (!S_ISDIR(inode->i_mode) && inode->u.ext2_i.i_ffsync_flag) {
+   blklist = inode->u.ext2_i.i_ffsync_blklist;
+   for (wait = 0; wait <=1; wait++) {
+   for (i=0; i < inode->u.ext2_i.i_ffsync_ptr; i++) {
+#if 0  /* Debugging */
+   if (!wait)
+   printk("Fasy sync: %d\n", blklist[i]);
+#endif 
+   err |= sync_block(inode, &blklist[i], wait);
+   }
+   }
+   } else {
+   for (wait=0; wait<=1; wait++) {
+   err |= sync_direct (inode, wait);
+   err |= sync_indirect (inode,
  inode->u.ext2_i.i_data+EXT2_IND_BLOCK,
- wait);
-   err |= sync_dindirect (inode,
+ wait);
+   err |= sync_dindirect (inode,
   inode->u.ext2_i.i_data+EXT2_DIND_BLOCK, 
-  wait);
-   err |= sync_tindirect (inode, 
+  wait);
+   err |= sync_tindirect (inode, 
   inode->u.ext2_i.i_data+EXT2_TIND_BLOCK, 
-  wait);
+  wait);
+   }
}
-skip:
+   inode->u.ext2_i.i_ffsync_flag = 1;
+   inode->u.ext2_i.i_ffsync_ptr = 0;
err |= ext2_sync_inode (inode);
+   return err;
+}
+
+/*
+ * File may be NULL when we are called by msync on a vma.  In the
+ * future, the VFS layer should be changed to not pass the struct file
+ * parameter to the fsync function, since it's not used by any of the
+ * implementations (and the dentry parameter is all that we need).
+ */
+int ext2_sync_file(struct file * file, struct dentry *dentry)
+{
+   int err = 0;
+
+   err = sync_inode(dentry->d_inode);
+   if (dentry->d_parent && dentry->d_parent->d_inode)
+   err |= sync_inode(dentry->d_parent->d_inode);
+   
return err ? -EIO : 0;
+}
+
+/*
+ * This function adds a list of blocks to be written out by fsync.  If
+ * it exceeds NUM_EXT2_FFSYNC_BLKS, then we turn off the fast fsync flag.
+ */
+void ext2_ffsync_add_blk(struct inode *inode, __u32 blk)
+{
+   int i;
+   __u32   *blklist;
+
+   if (inode->u.ext2_i.i_ffsync_flag == 0)
+   return;
+#if 0  /* Debugging */
+   printk("Add fast sync: %d\n", blk);
+#endif
+   blklist = inode->u.ext2_i.i_ffsync_blklist;
+   for (i=0; i < inode->u.ext2_i.i_ffsync_ptr; i++)
+   if (blklist[i] == blk)
+   return;
+   if (inode->u.ext2_i.i_f