Re: another ext3 question

2000-08-13 Thread Stephen C. Tweedie

Hi,

Sorry for the delay, I've been on holiday for a couple of weeks.

On Thu, Jul 27, 2000 at 07:36:34PM -0400, Jeremy Hansen wrote:
 
 ok ... to clarify ... ext3 _guarantees_ consistent file system metadata
 or empirically, it tends to be robust about maintaining consistent
 file system metadata across abrupt reboots?

It guarantees it --- all fs data (including things like quota data)
are guaranteed consistent across reboots without fsck.

--Stephen



Re: another ext3 question

2000-07-27 Thread Stephen C. Tweedie

Hi,

On Fri, Jul 21, 2000 at 11:54:20PM -0600, Andreas Dilger wrote:

 Note that you should not make the journals so large that they are a
 major fraction of your RAM, as you will not gain anything by this.
 A few megabytes is fine, 1024 disk blocks is the minimum.

Yep.  The main drawbacks to a large journal is that (a) they can pin a
lot of buffers in memory at once, and (b) they take longer to recover.
The only advantage of a large journal is that it gives the filesystem
more flexibility in writing things back to the main disk, but it's not
a large effect unless you have a very heavy write load.

Cheers,
 Stephen



Re: another ext3 question

2000-07-27 Thread Stephen C. Tweedie

Hi,

On Thu, Jul 27, 2000 at 01:41:54PM -0400, Jeremy Hansen wrote:
 
 We're really itching to use ext3 in a production environment.  Can you
 give any clues on how things are going?

The ext3-0.0.2f appears to be rock solid.  Andreas has got prototyped
code for e2fsck log replay, and I've got pending code for
out-of-memory and IO failures sitting here, along with most of the
code for metadata-only journaling.  I'm off on holiday in a few hours
for 2 weeks, but expect a release shortly after I get back with some
more goodies in it.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote:
 
 Here is one more for you:
   Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? If so, how do you update buffer_heads in
 pages that cover the relocated data? (Same goes for reiserfs, if they are
 doing something similar). BTW, our implementation of UFS is fucked up in
 that respect, so variant from there will not work.

For tail writes, I'd imagine we would just end up using the page cache
as a virtual cache as NFS uses it, and doing plain copy into the
buffer cache pages.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote:
 
 Not. Data normally is in page. Buffer_heads are not included into buffer
 cache. They are refered from the struct page and their -b_data just
 points to appropriate pieces of page. You can not get them via bread().
 At all. Buffer cache is only for metadata.

Only in the default usage.  There's no reason at all why we can't use
separate buffer and page cache aliases of the same data for tails as a
special case.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote:

  For tail writes, I'd imagine we would just end up using the page cache
  as a virtual cache as NFS uses it, and doing plain copy into the
  buffer cache pages.
 
 Ouch. I _really_ don't like it - we end up with special behaviour on one
 page in the pagecache.

Correct.  But it's all inside the filesystem, so there is zero VFS
impact.  And we're talking about non-block-aligned data for tails, so
we simply don't have a choice in this case.

 And getting data migration from buffer cache to
 page cache, which is Not Nice(tm).

Not preferred for bulk data, perhaps, but the VFS should cope just
fine.

 Yuck... Besides, when do we decide that
 tail is going to be, erm, merged? What will happen with the page then?

To the page?  Nothing.  To the buffer?  It gets updated with the new
contents of disk.  Page == virtual contents.  Buffer == physical
contents.  Plain and simple.

Cheers,
 Stephen




Re: question about sard and disk profiling patches

2000-07-25 Thread Stephen C. Tweedie

Hi,

On Mon, Jul 24, 2000 at 04:34:11PM -0400, Jeremy Hansen wrote:

 I build some customized kernel rpm's for our inhouse distro and I've
 incorporated the profiling patches to use the sard utility.  I'm just
 curious if there is any downside to using this patch for any reason and if
 there are any concerns for stability using this patch?

Not really.  On really low-memory machines the extra space used by the
sard profiling tables might be significant, but you'd have to be
running small embedded systems to notice that.  I have had no reports
of any sort of stability problems with any of the recent sard patches.

Cheers,
 Stephen



Re: PATCH: WIP super lock contention mitigation in 2.4

2000-07-14 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 12, 2000 at 07:12:39PM +0200, Juan J. Quintela wrote:
 
 Hi
 I have ported Stephen patch from 2.2.8 to 2.4, but now I have
 contention in the inode bitmaps.  Can I use the same trick
 there that he used in the block bitmap???

I've attached the patch I did for 2.3.99-pre7.  It uses a much simpler
trick for inodes: it just drops the lock while waiting in the
read_inode_bitmap functions and retakes it after IO.  That means that
the lock is still held for the duration of the allocation of the
inode, but that's a much rarer and also a much more lightweight
operation than block allocation, so it should work fine.

Cheers,
 Stephen


--- linux-2.3.99/fs/ext2/balloc.c.~1~   Wed Mar 29 22:35:22 2000
+++ linux-2.3.99/fs/ext2/balloc.c   Wed May  3 16:49:04 2000
@@ -9,6 +9,13 @@
  *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
  *  Big-endian to little-endian byte-swapping/bitmaps by
  *David S. Miller ([EMAIL PROTECTED]), 1995
+ *
+ *  Dropped use of the superblock lock.
+ *  This means that we have to take extra care not to block between any
+ *  operations on the bitmap and the corresponding update to the group 
+ *  descriptors.
+ *   Stephen C. Tweedie ([EMAIL PROTECTED]), 1999
+ * 
  */
 
 #include linux/config.h
@@ -16,7 +23,6 @@
 #include linux/locks.h
 #include linux/quotaops.h
 
-
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
@@ -70,42 +76,33 @@
 }
 
 /*
- * Read the bitmap for a given block_group, reading into the specified 
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group.
  *
- * Return =0 on success or a -ve error code.
+ * Return the buffer_head on success or NULL on IO error.
  */
 
-static int read_block_bitmap (struct super_block * sb,
-  unsigned int block_group,
-  unsigned long bitmap_nr)
+static struct buffer_head * read_block_bitmap (struct super_block * sb,
+unsigned int block_group)
 {
struct ext2_group_desc * gdp;
struct buffer_head * bh = NULL;
-   int retval = -EIO;

gdp = ext2_get_group_desc (sb, block_group, NULL);
if (!gdp)
-   goto error_out;
-   retval = 0;
+   return NULL;
+
bh = bread (sb-s_dev, le32_to_cpu(gdp-bg_block_bitmap), sb-s_blocksize);
if (!bh) {
ext2_error (sb, "read_block_bitmap",
"Cannot read block bitmap - "
"block_group = %d, block_bitmap = %lu",
block_group, (unsigned long) gdp-bg_block_bitmap);
-   retval = -EIO;
+   return NULL;
}
-   /*
-* On IO error, just leave a zero in the superblock's block pointer for
-* this group.  The IO will be retried next time.
-*/
-error_out:
-   sb-u.ext2_sb.s_block_bitmap_number[bitmap_nr] = block_group;
-   sb-u.ext2_sb.s_block_bitmap[bitmap_nr] = bh;
-   return retval;
+   return bh;
 }
 
+
 /*
  * load_block_bitmap loads the block bitmap for a blocks group
  *
@@ -122,9 +119,9 @@
 static int __load_block_bitmap (struct super_block * sb,
unsigned int block_group)
 {
-   int i, j, retval = 0;
+   int i, j;
unsigned long block_bitmap_number;
-   struct buffer_head * block_bitmap;
+   struct buffer_head * block_bitmap, *cached_bitmap = NULL;
 
if (block_group = sb-u.ext2_sb.s_groups_count)
ext2_panic (sb, "load_block_bitmap",
@@ -132,27 +129,34 @@
"block_group = %d, groups_count = %lu",
block_group, sb-u.ext2_sb.s_groups_count);
 
+ retry:
if (sb-u.ext2_sb.s_groups_count = EXT2_MAX_GROUP_LOADED) {
if (sb-u.ext2_sb.s_block_bitmap[block_group]) {
if (sb-u.ext2_sb.s_block_bitmap_number[block_group] ==
-   block_group)
+   block_group) {
+   brelse(cached_bitmap);
return block_group;
+   }
ext2_error (sb, "__load_block_bitmap",
"block_group != block_bitmap_number");
}
-   retval = read_block_bitmap (sb, block_group, block_group);
-   if (retval  0)
-   return retval;
+   if (!cached_bitmap)
+   goto load;
+   sb-u.ext2_sb.s_block_bitmap_number[block_group] = block_group;
+   sb-u.ext2_sb.s_block_bitmap[block_group] = cached_bitmap;
return block_group;
}
 
for (i = 0; i  sb-u.ext2_sb.s_loaded_block_bitmaps 
 

Re: PATCH: Trying to get back IO performance (WIP)

2000-07-03 Thread Stephen C. Tweedie

Hi,

On Mon, Jul 03, 2000 at 02:24:07AM +0200, Juan J. Quintela wrote:

 This patch is against test3-pre2.
 It gives here good performance in the first run, and very bad
 in the following ones of dbench 48.  I am hitting here problems with
 the locking scheme.  I get a lot of contention in __wait_on_supper.
 Almost all the dbench processes are waiting in:
 
0xc013639c __wait_on_super+0x184 (0xc13f4c00)
0xc01523e5 ext2_alloc_block+0x21 (0xc4840c20, 0x12901d, 0xc7427ea0)

Known, and I did a patch for this ages ago.  It actually didn't make a
whole lot of difference.  The last version of the ext2 diffs I did for
this are included below.

--Stephen



From [EMAIL PROTECTED]  Mon May  1 18:18:15 2000
Return-Path: sct
Received: (from sct@localhost)
by dukat.scot.redhat.com (8.9.3/8.9.3) id SAA16268
for [EMAIL PROTECTED]; Mon, 1 May 2000 18:18:14 +0100
Resent-From: "Stephen C. Tweedie" [EMAIL PROTECTED]
Resent-Message-ID: [EMAIL PROTECTED]
Resent-Date: Mon, 1 May 2000 18:18:14 +0100 (BST)
Resent-To: [EMAIL PROTECTED]
X-Authentication-Warning: worf.scot.redhat.com: sct set sender to 
[EMAIL PROTECTED] using -f
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: [EMAIL PROTECTED]
In-Reply-To: [EMAIL PROTECTED]
References: [EMAIL PROTECTED]
[EMAIL PROTECTED]
X-Mailer: VM 6.59 under Emacs 20.3.1
From: "Stephen C. Tweedie" [EMAIL PROTECTED]
To: Linus Torvalds [EMAIL PROTECTED]
Cc: Ingo Molnar [EMAIL PROTECTED], "Stephen C. Tweedie" [EMAIL PROTECTED],
[EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: DANGER: DONT apply it. Re: [patch] ext2nolock-2.2.8-A0
Date: Sat, 15 May 1999 03:05:41 +0100 (BST)
Status: RO
Content-Length: 22531
Lines: 755

Hi,

Linus Torvalds writes:

  In particular, ext2_new_block() returns a block number. That's all fine
  and dandy, but it's extremely and utterly idiotic. It means that
  ext2_new_block() needs to clear the block, which is where all the
  race-avoidance comes from as far as I can tell. 

There are two sets of races involved.  One involves consistency of the
bitmaps themselves: we need to make sure that concurrent allocations
are consistent.  The second is consistency of inodes, and that is
already handled in fs/ext2/inode.c: when we call ext2_alloc_block(),
we follow it up with a check to see if anybody else has allocated the
same block in the same inode while we slept.  If so, we free the block
and repeat the block search.

So, there's no reason why block clearing needs to be locked: the block
cannot be installed in the inode indirection maps until we have
completely returned from ext2_new_block() anyway.

Anyway, the patch below (against 2.2.8) is an extension of the last
one I posted, and it gets rid of the superblock lock in balloc.c
entirely.  We do have to be a bit more careful about allocation
consitency without the lock: in particular, quota operations used to
happen between bitmap and group descriptor updates, but now we can't
risk blocking while the bitmaps and group descriptors are
inconsistent.

It compiles but I can't test right now.  Feel free to play with it if
you want to, but I do believe we can safely go without superblock
locks in both ialloc.c and balloc.c if we are careful.

The main change that dropping the lock imposes on us is that we cannot
rely on the bitmap buffer remaining pinned in cache for the duration
of the allocation, so we have to bump the buffer_head b_count in
load_block_bitmap() and brelse it once we are finished with it.
Shift-scrlck will be very useful here for spotting the buffer_head
leaks I've introduced. :)

--Stephen

alloc-2.2.8.diff:

--- fs/ext2/balloc.c.~1~Thu Oct 29 05:54:56 1998
+++ fs/ext2/balloc.cFri May 14 14:46:59 1999
@@ -9,6 +9,13 @@
  *  Enhanced block allocation by Stephen Tweedie ([EMAIL PROTECTED]), 1993
  *  Big-endian to little-endian byte-swapping/bitmaps by
  *David S. Miller ([EMAIL PROTECTED]), 1995
+ *
+ *  Dropped use of the superblock lock.
+ *  This means that we have to take extra care not to block between any
+ *  operations on the bitmap and the corresponding update to the group 
+ *  descriptors.
+ *   Stephen C. Tweedie ([EMAIL PROTECTED]), 1999
+ * 
  */
 
 /*
@@ -74,42 +81,33 @@
 }
 
 /*
- * Read the bitmap for a given block_group, reading into the specified 
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group.
  *
- * Return =0 on success or a -ve error code.
+ * Return the buffer_head on success or NULL on IO error.
  */
 
-static int read_block_bitmap (struct super_block * sb,
-  unsigned int block_group,
-  unsigned long bitmap_nr)
+static struct buffer_head * read_block_bitmap (struct super_block * sb,
+  

O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Stephen C. Tweedie

Hi all,

The following patch fully implements O_SYNC, fsync and fdatasync,
at least for ext2.  The infrastructure it includes should make it
trivial for any other filesystem to do likewise.

The basic changes are:

Include a per-inode list of dirty buffers

Pass a "datasync" parameter down to the filesystems when fsync
or fdatasync are called, to distinguish between the two (when
fdatasync is specified, we don't have to flush the inode to disk
if only timestamps have changed)

Split I_DIRTY into two bits, one (I_DIRTY_SYNC) which is set
for all dirty inodes, and the other (I_DIRTY_DATASYNC) which 
is set only if fdatasync needs to flush the inode (ie. it is
set for everything except for timestamp updates).  This means:

The old (flags  I_DIRTY) construct still returns 
true if the inode is in any way dirty; and

(flags |= I_DIRTY) sets both bits, as expected.

fs/ext2 and __block_commit_write are modified to record the
all newly dirtied buffers (both data and metadata) on the
inode's dirty block list

generic_file_write now honours the O_SYNC flag and calls
generic_osync_inode(), which flushes the inode dirty buffer
list and calls the inode's fsync method.

Note: currently, the O_SYNC code in generic_file_write calls 
generic_osync_inode with datasync==1, which means that O_SYNC is
interpreted as O_DSYNC according to the SUS spec.  In other words,
O_SYNC is not guaranteed to flush timestamp updates to disk (but
fsync is).  This is important: we do not currently have an O_DSYNC
flag (although that would now be trivial to implement), so existing
apps are forced to use O_SYNC instead.  Apps such as Oracle rely on
O_SYNC for write ordering, but due to a 2.2 bug, existing kernels
don't do the timestamp update and hence we achieve decent 
performance even without O_DSYNC.  We cannot suddenly cause all of
those applications to experience a massive performance drop.

One way round this would be to split O_SYNC into O_DSYNC and
O_TRUESYNC, and in glibc to redefine O_SYNC to be (O_DSYNC |
O_TRUESYNC).  If we keep the new O_DSYNC to have the same value
as the old O_SYNC, then:

* Old applications which specified O_SYNC will continue
  to get their expected (O_DSYNC) behaviour

* New applications can specify O_SYNC or O_DSYNC and get
  the selected behaviour on new kernels

* New applications calling either O_SYNC or O_DSYNC will
  still get O_SYNC on old kernels.

In performance testing, "dd" with 64k blocks and writing into an 
existing, preallocated file, gets close to theoretical disk bandwidth
(about 13MB/sec on a Cheetah), when using O_SYNC or when doing a
fdatasync between each write.  Doing fsync instead gives only about
3MB/sec and results in a lot of audible disk seeking, as expected.
If I don't preallocate the file, then even fdatasync is slow, as it
now has to sync the changed i_size information after every write (and
it gets slower as the file grows and the distance between the inode 
and the data being written increases).

--Stephen


--- linux-2.4.0-test1-ac11.osync/fs/block_dev.c.~1~ Fri Jun  9 18:08:09 2000
+++ linux-2.4.0-test1-ac11.osync/fs/block_dev.c Fri Jun  9 18:08:18 2000
@@ -313,7 +313,7 @@
  * since the vma has no handle.
  */
  
-static int block_fsync(struct file *filp, struct dentry *dentry)
+static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
 {
return fsync_dev(dentry-d_inode-i_rdev);
 }
--- linux-2.4.0-test1-ac11.osync/fs/buffer.c.~1~Fri Jun  9 18:08:09 2000
+++ linux-2.4.0-test1-ac11.osync/fs/buffer.cFri Jun  9 18:08:18 2000
@@ -68,6 +68,8 @@
  * lru_list_lock  hash_table_lock  free_list_lock  unused_list_lock
  */
 
+#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_inode_buffers)
+
 /*
  * Hash table gook..
  */
@@ -323,7 +325,7 @@
  * filp may be NULL if called via the msync of a vma.
  */
  
-int file_fsync(struct file *filp, struct dentry *dentry)
+int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 {
struct inode * inode = dentry-d_inode;
struct super_block * sb;
@@ -332,7 +334,7 @@
 
lock_kernel();
/* sync the inode to buffers */
-   write_inode_now(inode);
+   write_inode_now(inode, 0);
 
/* sync the superblock to buffers */
sb = inode-i_sb;
@@ -373,7 +375,7 @@
 
/* We need to protect against concurrent writers.. */
down(inode-i_sem);
-   err = file-f_op-fsync(file, dentry);
+   err = file-f_op-fsync(file, dentry, 0);
up(inode-i_sem);
 
 out_putf:
@@ -406,9 +408,8 @@
if (!file-f_op || !file-f_op-fsync)
goto out_putf;
 
-   /* this needs further work, at the moment it is identical to fsync() */
down(inode-i_sem);
-   err = file-f_op-fsync(file, 

Re: O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Stephen C. Tweedie

Hi,

On Fri, Jun 09, 2000 at 02:53:19PM -0700, Ulrich Drepper wrote:
 
  If I don't preallocate the file, then even fdatasync is slow, [...]
 
 This might be a good argument to implement posix_fallocate() in the
 kernel.

No.  If we do posix_fallocate(), then there are only two choices:
we either pre-zero the file contents (in which case we are as well
doing it from user space), or we record in the inode that the file
isn't pre-zeroed and so optimise things.

If we do that optimisation, then doing an O_DSYNC write to the 
already-allocated file will have to record in the inode that we are
pushing forward the non-prezeroed fencepost in the file, so we end
up having to seek back to the inode for each write anyway, so we
lose any possible benefit during the writes.

Once you have a database file written and preallocated, this is all
academic since all further writes will be in place and so will be 
fast the th O_DSYNC/fdatasync support.

Cheers,
 Stephen



Re: O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Stephen C. Tweedie

Hi,

On Fri, Jun 09, 2000 at 02:51:18PM -0700, Ulrich Drepper wrote:
 
 Have you thought about O_RSYNC and whether it is possible/useful to
 support it separately?

It would be possible and useful, but it's entirely separate from the
write path and probably doesn't make sense until we've got O_DIRECT
working (O_RSYNC is closely related to that).

Cheers,
 Stephen



Re: [prepatch] Directory Notification

2000-05-22 Thread Stephen C. Tweedie

Hi,

On Sun, May 21, 2000 at 04:27:29PM +, Ton Hospel wrote:
  
  It delivers a realtime signal to tasks which have requested it.  The task
  can then call fstat to find out what changed.
  
 A poll() notification mechanism should be at least as useful for e.g.
 GUI's who generally prefer to have synchronous notification.

Check "sigtimedwait()".  You can use rtsignals to provide synchronous
notification if you want.  It's hard to use poll() to provide async
notification, however.

--Stephen





Re: [linux-audio-dev] Re: File writes with O_SYNC slow

2000-04-20 Thread Stephen C. Tweedie

Hi,

On Thu, Apr 20, 2000 at 10:57:15AM +0200, Benno Senoner wrote:
 
 I tried all combinations using my hdtest.c which I posted yesterday.
 
 I tried O_SYNC and even O_DSYNC on the SGI (Origin 2k),
 (D_SYNC syncs only data blocks but not metadata blocks)

Not quite.  O_DSYNC syncs metadata too.  The only thing it skips is
inode timestamps.  

There is an important difference between the two when you are overwriting
an existing allocated file.  In that case, there are no metadata changes
except for timestamp updates, so O_DSYNC is very much faster.  However,
if you are appending to a file, then some metadata updates (for file
mapping information and for the file size) are necessary for both 
O_SYNC and O_DSYNC.
 
 write() + fsync()/fdatasync() on linux doesn't work well too since the kernel
 isn't able to optimize disk writing by using the elevator algorithm.

There are other problems with fsync/fdatasync in 2.2.  In particular,
it is slow for larger files since it tries to scan all the mapping
information for the entire file.

I'll put together an old patch I did to make fsync/fdatasync and
O_DSYNC work much faster.  It will be interesting to see if it makes
much difference, and it may be the stick we need to beat Linus into
believing that this change is really quite important.

--Stephen



Re: File writes with O_SYNC slow

2000-04-19 Thread Stephen C. Tweedie

Hi,

On Wed, Apr 19, 2000 at 11:55:04AM -0400, Karl JH Millar wrote:
 
 I've noticed that file writes with O_SYNC are very much slower than they should
 be.

How fast do you think they should be?

If you are doing small appends, then O_SYNC is _guaranteed_ to be dead
slow.  Ever write involves updating the data, updating the mapping tree,
and flushing the whole lot synchronously to disk.  That's two seeks per
write.

--Stephen



O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 05:58:48PM -0500, Steve Lord wrote:
 
 O_DIRECT on Linux XFS is still a work in progress, we only have
 direct reads so far. A very basic implementation was made available
 this weekend.

Care to elaborate on how you are doing O_DIRECT?

It's something I've been thinking about in the general case.  Basically
what I want to do is this:

Augment the inode operations with a new operation, "rw_kiovec" which
performs reads and writes on vectors of kiobufs.  

Provide a generic_rw_kiovec() function which uses the existing page-
oriented IO vectors to set up page mappings much as generic_file_{read,
write} do, but honouring the following flags in the file descriptor:

 * O_ALIAS
   
   Allows the write function to install the page in the kiobuf 
   into the page cache if the data is correctly aligned and there is
   not already a page in the page cache.

   For read, the meaning is different: it allows existing pages in 
   the page cache to be installed into the kiobuf.

 * O_UNCACHE

   If the IO created a new page in the page cache, then attempt to
   unlink the page after the IO completes.

 * O_SYNC

   Usual meaning: wait for synchronous write IO completion.

O_DIRECT becomes no more than a combination of these options.

Furthermore, by implementing this mechanism with kiobufs, we can go
one step further and perform things like Larry's splice operations by
performing reads and writes in kiobufs.  Using O_ALIAS kiobuf reads and
writes gives us copies between regular files entirely in kernel space
with the minimum possible memory copies.  sendfile() between regular
files can be optimised to use this mechanism.  The data never has to
hit user space.

As an example of the flexibility of the interface, you can perform
an O_ALIAS, O_UNCACHE sendfile to copy one file to another, with full
readahead still being performed on the input file but with no memory 
copies at all.  You can also choose not to have O_UNCACHE and O_SYNC
on the writes, in which case you have both readahead and writebehind
with zero copy.

This is all fairly easy to implement (at least for ext2), and gives
us much more than just O_DIRECT for no extra work.

--Stephen



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 07:10:43PM +0200, Martin Schenk wrote:

 If you are interested in a more efficient fsync (and a real fdatasync),
 I have some patches that provide better performance for very large
 files (where fsync is mostly busy scanning the page cache for changes),
 and a fdatasync that eliminates writing the inode if not necessary.
 (at the moment these patches are only for 2.3.4?, and I don't have
 the time to keep them up to date - especially as nobody was interested
 the last time I posted them)

Please post them, I will have a look at them.

--Stephen



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 18, 2000 at 10:57:25AM -0400, Paul Barton-Davis wrote:
  1) pre-allocation takes a *long* time. Allocating 24 203MB files on a
 clean ext2 partition of 18GB takes many, many minutes, for example.
 Presumably, the same overhead is being incurred when block
 allocation happens "on the fly".
 
 It is not the allocation which is taking ages, it's the actual
 writing of the data.  
 
 Except that for preallocation, I only write one byte in every block,
 so for a 203MB file, I only write 52K approximately (ext2 4K blocks).

Umm, how is that going to make _any_ difference at all?  The filesystem
works in blocks, not bytes.  You still end up with 203MB of disk IO.

--Stephen



Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 18, 2000 at 07:56:04AM -0500, Steve Lord wrote:
 
 XFS is using the pagebuf code we wrote (or I should say are writing - it
 needs a lot of work yet). This uses kiobufs to represent data in a set of
 pages. So, we have the infrastructure to take a kiobuf and read or write
 it from disk (OK, it uses buffer heads under the covers).

That's fine, and in fact is exactly what kiobufs were designed for:
to abstract out the storage of the buffer from whatever construction
you happen to use to do the IO.  (Raw IO also uses buffer_heads 
internally but passes data around in kiobufs.)

 I said basic implementation because it is currently paying no attention
 to cached data. The Irix approach to this was to flush or toss cached
 data which overlapped a direct I/O, I am leaning towards keeping them
 as part of the I/O.

The big advantage of the scheme where I map the kiobuf pages into the
real page cache before the I/O, and unmap after, is that cache
coherency at the beginning of the I/O and all the way through it is
guaranteed.  The cost is that the direct I/O may end up doing copies
if there is other I/O going on at the same time to the same page, but
I don't see that as a problem!

   o using caching to remove the alignment restrictions on direct I/O by
 doing unaligned head and tail processing via buffered I/O.

I'm just planning on doing a copy for any unaligned I/O.  Raw character
devices simply reject unaligned I/O for now, but O_DIRECT will be a 
bit more forgiving.

  It's something I've been thinking about in the general case.  Basically
  what I want to do is this:
  
  Augment the inode operations with a new operation, "rw_kiovec" which
  performs reads and writes on vectors of kiobufs.  
 
 You should probably take a look at what we have been doing to the ops,
 although our extensions are really biased towards extent based filesystems,
 rather than using getblock to identify individual blocks of file data we
 added a bmap interface to return a larger range - this requires different
 locking semantics than getblock, since the mapping we return covers multiple
 pages. I suspect that any approach which assembles multiple pages in advance
 is going to have similar issues.

OK.  These are probably orthogonal for now, but doing extent bmaps is
an important optimisation.  

Ultimately we are going to have to review the whole device driver 
interface.  We need that both to do things like 2TB block devices, and
also to achieve better efficiency than we can attain right now with a
separate buffer_head for every single block in the I/O.  It's just using
too much CPU; being able to pass kiobufs directly to ll_rw_block along
with a block address list would be much more efficient.

 So if O_ALIAS allows user pages to be put in the cache (provided you use
 O_UNCACHE with it), you can do this.

Yes.

 However, O_DIRECT would be a bit more
 than this - since if there already was cached data for part of the I/O
 you still need to copy those pages up into the user pages which did not
 get into cache. 

That's the intention --- O_ALIAS _allows_ the user page to be mapped 
into the cache, but if existing cached data or alignment constraints
prevent that, it will fall back to doing a copy.

One consequence is that O_DIRECT I/O from a file which is already cached
will always result in copies, but I don't mind that too much.

 We (SGI) really need to get better hooked in on stuff like this - I really
 don't want to see us going off in one direction (pagebuf) and all the other
 filesystems going off in a different direction.

The pagebuf stuff sounds like it is fairly specialised for now.  As
long as all of the components that we are talking about can pass kiobufs
between themselves, we should be able to make them interoperate pretty
easily.

Is the pagebuf code intended to be core VFS functionality or do you
see it being an XFS library component for the forseeable future?
 
 p.s. did you know we also cache meta data in pages directly?

That was one of the intentions in the new page cache structure, and we
may actually end up moving ext2's metadata caching to use the page 
cache too in the future.  

--Stephen



Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 18, 2000 at 01:17:52PM -0500, Steve Lord wrote:

 So I guess the question here is how do you plan on keeping track of the
 origin of the pages?

You don't have to.

 Which ones were originally part of the kernel cache
 and thus need copying up to user space?

If the caller requested O_ALIAS, then the IO routine is allowed to
free any page in the kiobuf and alias the existing page cache page into
the kiobuf.  It doesn't actually matter whether the original page in
the page cache was there when we started, or whether it was a kiobuf
page which we mapped into the page cache for a direct IO.  The whole
point is that as long as the IO is in progress, it is a perfectly 
legal page in the page cache.

There is one nastiness.  This use of the kiobuf effectively results
in a temporary mmap() of the file while the IO is in progress.  If 
another thread happens to write to the page undergoing an O_DIRECT
read, then we end up modifying the page cache.  That's bad.

So, we need to make sure that while the IO is in progress for a true
raw IO, we keep the page locked during the IO and mark the page not-
uptodate after the IO completes.  That way, even if another process
looks up the page while the IO is in progress, the data that was
undergoing IO will be kept private, and we get a second chance once
the IO has completed to see that the page is now shared and we have
to do a copy.

That is slightly messy, but it nicely hides the transient mmap while
still preserving zero-copy for all of the important cases.  It's 
about the cleanest solution I can see which preserves complete
cache coherency at all times, because it guarantees that the IO is
always done inside the page cache itself.

 It does not seem hard, just wondering
 what you had in mind. Also, I presume, if the page was already present
 and up to date then on a read you would not refill it from disk - since it
 may be more recent that the on disk data, existing buffer heads would
 give you this information. 

It's not a physical buffer_head lookup, it's a logical page cache 
lookup that we would do, but yes, we'd read from the page cache in 
this case and just do a copy.

  One consequence is that O_DIRECT I/O from a file which is already cached
  will always result in copies, but I don't mind that too much.
 
 So maybe an O_CLEANCACHE (or something similar) could be used to indicate
 that anything which is found cached should be moved out of the way (flushed
 to disk or tossed depending on what is happening).

That's an orthogonal issue: posix_fadvise() should be the mechanism for
that if the application truly wants to do explicit cache eviction.

--Stephen




Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-17 Thread Stephen C. Tweedie

Hi,

On Fri, Apr 14, 2000 at 06:15:09PM +1000, Andrew Clausen wrote:
 
 Any comments?

Yes!

 Date: Fri, 14 Apr 2000 08:10:10 -0400
 Message-Id: [EMAIL PROTECTED]
 From: Paul Barton-Davis [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: [linux-audio-dev] info point on linux hdr
 Sender: [EMAIL PROTECTED]
 Precedence: bulk
 X-Mozilla-Status2: 
 
 i mentioned in some remarks to benno how important i thought it was to
 preallocate the files used for hard disk recording under linux.

Preallocation will make little difference.  The real issue is that the
buffer cache is doing write-behind, ie. it is batching up the writes into
big chunks which get blasted to disk once every five seconds or so,
causing large IO request queues to accumulate when that happens.

That's great for normal use because it means that trickles of write
activity don't tie up the spindles the whole time, but it's not ideal
for audio recording.

Try opening the file with open(O_SYNC), and write the data in 128k chunks
or so.  Alternatively call fsync() every so often to force the data to 
disk (though fsync is not particularly efficient on large files with
2.2).  

The down-side of O_SYNC is that the application blocks until all the
data is on disk, but the benefit is that the IO queue size is much more
controlled.  If you have a dedicated thread doing the disk IO, then the
latency of blocking while the IO happens is irrelevant.

--Stephen



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-17 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 04:50:05PM +0200, Benno Senoner wrote:
 
 Stephen, I tried all possible combinations , in my hdrbench code.
...

 I tried:
 -fsync()  on all write descriptors at regular intervals ranging from 1sec to
 10sec
 - fdatasync() on all write descriptors , same as above
 - sync()
 - opening output files with O_SYNC  
 
 NO LUCK AT ALL.  :-(

Of course not.  You are doing sync IO in each case, and are doing it
from a single thread.

 Again I think O_DIRECT is the only way which will allow us to use smaller
 buffers, but we will gain almost nothing in terms of total delivered MB/secs.

No.  O_DIRECT will gain you *nothing* in terms of IO scheduling over O_SYNC.
The only thing O_DIRECT gains is zero-copy and zero-caching, which results
in lower CPU utilisation.

The only way you can get much better is to do non-writeback IO 
asynchronously.  Use O_SYNC for writes, and submit the IOs from multiple
threads, to let the kernel schedule the multiple IOs.  Use large block
sizes for each IO to prevent massive amounts of disk seeking.  O_DIRECT
in this case is not an instant win: it is completely orthogonal to the
IO scheduling issue.

--Stephen



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-17 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 07:21:31PM +0200, Benno Senoner wrote:

  The only way you can get much better is to do non-writeback IO 
  asynchronously.  Use O_SYNC for writes, and submit the IOs from multiple
  threads, to let the kernel schedule the multiple IOs.  Use large block
  sizes for each IO to prevent massive amounts of disk seeking.  O_DIRECT
  in this case is not an instant win: it is completely orthogonal to the
  IO scheduling issue.
 
 Are you suggesting to fire up multiple threads where each writes a couple of
 files (in 256kb chunks) with  O_SYNC ?

That sort of thing, yes.

 how many threads in you opinion ?

Good question --- somebody really needs to benchmark it.  At least one per
file, though.
 
 The reading thread: should that still be only one, in order to prevent seeking ?

Maybe.  There are competing pressures.  You don't want the mapping 
information in the files to cause extra seeks, so there is some compromise
involved.  I'd guess at least two threads per file, but you _really_ 
need to get it profiled.  What is really needed here is a more efficient
way of encoding the files on disk using fewer indirection blocks: it's
likely to be indirection seeks as much as data seeks which cost here.

 Anyway I don't understand particularly well why multiple writers should buy us
 anything, since the writing thread basically does nothing but issuing write()s
 to disk. That's manual scheduling of write()s in user space (in 256k blocks).

Because if you have got enough IO requests sent to the kernel that you
can present all your outstanding writes at once, then the kernel can
sort out an efficient way of writing them one at a time with minimal seeks.

 Since the thread runs SCHED_FIFO (or nice(-20) ), it should be always 
 ready to issue new write() requests as soon as the disk finished the previous
 one.
 
 Or am I missing something ?

Pipelining.  With O_SYNC (or O_DIRECT), you have a latency between the 
completion of one command and the application sending the next command.
With the data all sent into the kernel, we can actually queue multiple
commands to the disk at once.  It's a lot more efficient.

 Anyway I think that O_SYNC in a single threaded model is very slow

Yes, it will be!

 performance. Hopefully this is not true for O_DIRECT too, or it will be
 unusable. (with O_SYNC I a mere 50-60% # of tracks compared to running without
 O_SYNC). 

Of course it will be the same for O_DIRECT.  How can it be any different?
It's the same IOs being scheduled.

--Stephen



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-17 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 01:45:15PM -0400, Paul Barton-Davis wrote:
  2) Why am I not having any of these problems ? Unlike Benno's code, I
 Seagate 4.5GB Cheetah U2W 10K rpm  IBM 9GB UltraStar U2W 10K rpm
 Quantum 4.5GB Viking U2W 7.5K rpm  3 x IBM 18GB UltraStar
 
 Ahh --- SCSI.  The request queuing for SCSI is very different to 
 that for non-SCSI devices.  
 
 Different enough that you feel its likely to explain the significant
 difference between my experience and that of both Benno and Juhana
 when trying to record to disk ?

Potentially, yes.

 Is it different enough to explain why
 the buffer cache write-behind batching doesn't seem to show up as a
 problem for me ?

Again, yes.  It's definitely worth trying the elevator fix.

 Stephen - thanks for paying attention and giving us time on this. I
 know you have a lot to work on, and that HDR is not what most people
 consider to be a hot Linux application;

No problem.  Actually, many of the problems you are encountering
have parallels in high-end server systems too.  This is definitely
an area where there is a lot of potential for interesting optimisations.

--Stephen



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-17 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 01:05:12PM -0400, Paul Barton-Davis wrote:
 
 2) Why am I not having any of these problems ? Unlike Benno's code, I
Seagate 4.5GB Cheetah U2W 10K rpm IBM 9GB UltraStar U2W 10K rpm
Quantum 4.5GB Viking U2W 7.5K rpm 3 x IBM 18GB UltraStar

Ahh --- SCSI.  The request queuing for SCSI is very different to 
that for non-SCSI devices.  The 2.3 kernel includes a new scheduling
algorithm that can significantly improve performance on non-SCSI
block devices where you have large streaming IOs going on.

Could people seeing bad performance on 2.2/IDE please try out the
elevator diffs from Andrea Arcangeli?  Find them at

   
ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.2/2.2.15pre17a/elevator-starvation-8.gz

--Stephen



Re: Ext2 / VFS projects

2000-02-11 Thread Stephen C. Tweedie

Hi,

On Thu, 10 Feb 2000 10:27:29 -0500 (EST), Alexander Viro
[EMAIL PROTECTED] said:

 Correct, but that's going to make design much more complex - you really
 don't want to do it for anything other than sub-page stuff (probably even
 sub-sector). Which leads to 3 levels - allocation block/IO block/sub-sector
 fragment. Not to mention the fact that for cases when you have 1K
 fragments and really large blocks you don't want all this mess around...
 It's doable, indeed, but...

Sure, but to me the main question is this --- can we do this sort of
fragment support in ext3 without having to add complexity to the rest of
the VM/VFS?  I think the answer is yes.

--Stephen



Re: Ext2 / VFS projects

2000-02-10 Thread Stephen C. Tweedie

Hi,

On Wed, 9 Feb 2000 14:30:13 -0500 (EST), Alexander Viro
[EMAIL PROTECTED] said:

 On Wed, 9 Feb 2000 [EMAIL PROTECTED] wrote:

 with 2k blocks and 128 byte fragments, we get to really reduce wasted
 space below any other system i've ever experienced.

 Erm... I'm afraid that you are missing the point. You will get the
 hardware sectors shared between the files. And you can't pass requess
 smaller than that. _And_ you have to lock the bh when you do IO. Now,
 estimate the fun with deadlocks...

That shoudn't matter.  In the new VM it would be pretty trivial for the
filesystem to reserve a separate address_space against which to cache
fragment blocks.  Populating that address_space when we want to read a
fragment block doesn't have to be any more complex than populating the
page cache already is.  IO itself shouldn't be hard.

Yes, this will end up double-caching fragmented files to some extent,
since we'll have to reserve a separate, non-physically-mapped page for
the tail of a fragmented file.

Allocation/deallocation of fragments themselves obviously has to be done
very carefully, but we already have to deal with that sort of race in
the filesystem for normal allocations --- this isn't really any
different in principle.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-16 Thread Stephen C. Tweedie

Hi,

Benno Senoner writes:

  wow, really good idea to journal to a RAID1 array !
  
  do you think it is possible to to the following:
  
  - N disks holding a soft RAID5  array.
  - reserve a small partition on at least 2 disks of the array to hold a RAID1
  array.
  - keep the journal on this partition.

Yes.  My jfs code will eventually support this.  The main thing it is
missing right now is the ability to journal multiple devices to a
single journal: the on-disk structure is already designed with that in
mind but the code does not yet support it.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-16 Thread Stephen C. Tweedie

Hi,

Chris Wedgwood writes:

   This may affect data which was not being written at the time of the
   crash.  Only raid 5 is affected.
  
  Long term -- if you journal to something outside the RAID5 array (ie.
  to raid-1 protected log disks) then you should be safe against this
  type of failure?

Indeed.  The jfs journaling layer in ext3 is a completely generic
block device journaling layer which could be used for such a purpose
(and raid/LVM journaling is one of the reasons it was designed this
way).

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-13 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 22:09:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 Sorry for my ignorance I got a little confused by this post:

 Ingo said we are 100% journal-safe, you said the contrary,

Raid resync is safe in the presence of journaling.  Journaling is not
safe in the presence of raid resync.

 can you or Ingo please explain us in which situation (power-loss)
 running linux-raid+ journaled FS we risk a corrupted filesystem ?

Please read my previous reply on the subject (the one that started off
with "I'm tired of answering the same question a million times so here's
a definitive answer").  Basically, there will always be a small risk of
data loss if power-down is accompanied by loss of a disk (it's a
double-failure); and the current implementation of raid resync means
that journaling will be broken by the raid1 or raid5 resync code after a
reboot on a journaled filesystem (ext3 is likely to panic, reiserfs will
not but will still get its IO ordering requirements messed up by the
resync). 

 After the reboot if all disk remain intact physically, will we only
 lose the data that was being written, or is there a possibility to end
 up in a corrupted filesystem which could more damages in future ?

In the power+disk failure case, there is a very narrow window in which
parity may be incorrect, so loss of the disk may result in inability to
correctly restore the lost data.  This may affect data which was not
being written at the time of the crash.  Only raid 5 is affected.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 16:41:55 -0600, "Mark Ferrell"
[EMAIL PROTECTED] said:

   Perhaps I am confused.  How is it that a power outage while attached
 to the UPS becomes "unpredictable"?  

One of the most common ways to get an outage while on a UPS is somebody
tripping over, or otherwise removing, the cable between the UPS and the
computer.  How exactly is that predictable?

Just because you reduce the risk of unexpected power outage doesn't mean
we can ignore the possibility.

--Stephen



[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

This is a FAQ: I've answered it several times, but in different places,
so here's a definitive answer which will be my last one: future
questions will be directed to the list archives. :-)

On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 then raid can miscalculate parity by assuming that the buffer matches
 what is on disk, and that can actually cause damage to other data
 than the data being written if a disk dies and we have to start using
 parity for that stripe.

 do you know if using soft RAID5 + regular etx2 causes the same sort of
 damages, or if the corruption chances are lower when using a non
 journaled FS ?

Sort of.  See below.

 is the potential corruption caused by the RAID layer or by the FS
 layer ?  ( does need the FS code or the RAID code to be fixed ?)

It is caused by neither: it is an interaction effect.

 if it's caused by the FS layer, how does behave XFS (not here yet ;-)
 ) or ReiserFS in this case ?

They will both fail in the same way.

Right, here's the problem:

The semantics of the linux-2.2 buffer cache are not well defined with
respect to write ordering.  There is no policy to guide what gets
written and when: the writeback caching can trickle to disk at any time,
and other system components such as filesystems and the VM can force a
write-back of data to disk at any time.

Journaling imposes write ordering constraints which insist that data in
the buffer cache *MUST NOT* be written to disk unless the filesystem
explicitly says so.

RAID-5 needs to interact directly with the buffer cache in order to be
able to improve performance.

There are three nasty interactions which result:

1) RAID-5 tries to bunch writes of dirty buffers up so that all the data
   in a stripe gets written to disk at once.  For RAID-5, this is very
   much faster than dribbling the stripe back one disk at a time.
   Unfortunately, this can result in dirty buffers being written to disk
   earlier than the filesystem expected, with the result that on a
   crash, the filesystem journal may not be entirely consistent.

   This interaction hits ext3, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit set.

2) RAID-5 peeks into the buffer cache to look for buffer contents in
   order to calculate parity without reading all of the disks in a
   stripe.  If a journaling system tries to prevent modified data from
   being flushed to disk by deferring the setting of the buffer dirty
   flag, then RAID-5 will think that the buffer, being clean, matches
   the state of the disk and so it will calculate parity which doesn't
   actually match what is on disk.  If we crash and one disk fails on
   reboot, wrong parity may prevent recovery of the lost data.

   This interaction hits reiserfs, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit clear.

Both interactions 1) and 2) can be solved by making RAID-5 completely
avoid buffers which have an incremented b_count reference count, and
making sure that the filesystems all hold that count raised when the
buffers are in an inconsistent or pinned state.

3) The soft-raid backround rebuild code reads and writes through the
   buffer cache with no synchronisation at all with other fs activity.
   After a crash, this background rebuild code will kill the
   write-ordering attempts of any journalling filesystem.  

   This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

Interaction 3) needs a bit more work from the raid core to fix, but it's
still not that hard to do.


So, can any of these problems affect other, non-journaled filesystems
too?  Yes, 1) can: throughout the kernel there are places where buffers
are modified before the dirty bits are set.  In such places we will
always mark the buffers dirty soon, so the window in which an incorrect
parity can be calculated is _very_ narrow (almost non-existant on
non-SMP machines), and the window in which it will persist on disk is
also very small.

This is not a problem.  It is just another example of a race window
which exists already with _all_ non-battery-backed RAID-5 systems (both
software and hardware): even with perfect parity calculations, it is
simply impossible to guarantee that an entire stipe update on RAID-5
completes in a single, atomic operation.  If you write a single data
block and its parity block to the RAID array, then on an unexpected
reboot you will always have some risk that the parity will have been
written, but not the data.  On a reboot, if you lose a disk then you can
reconstruct it incorrectly due to the bogus parity.

THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
only way you can get bitten by this failure mode is to have a system
failure and a disk failure at the same time.


--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 20:17:22 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 Assume all RAID code - FS interaction problems get fixed, since a
 linux soft-RAID5 box has no battery backup, does this mean that we
 will loose data ONLY if there is a power failure AND successive disk
 failure ?  If we loose the power and then after reboot all disks
 remain intact can the RAID layer reconstruct all information in a safe
 way ?

Yes.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-07 Thread Stephen C. Tweedie

Hi,

On Fri, 07 Jan 2000 00:32:48 +0300, Hans Reiser [EMAIL PROTECTED] said:

 Andrea Arcangeli wrote:
 BTW, I thought Hans was talking about places that can't sleep (because of
 some not schedule-aware lock) when he said "place that cannot call
 balance_dirty()".

 You were correct.  I think Stephen and I are missing in communicating here.

Fine, I was just looking at it from the VFS point of view, not the
specific filesystem.  In the worst case, a filesystem can always simply
defer marking the buffer as dirty until after the locking window has
passed, so there's obviously no fundamental problem with having a
blocking mark_buffer_dirty.  If we want a non-blocking version too, with
the requirement that the filesystem then to a manual rebalance once it
is safe to do so, that will work fine too.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

2000-01-06 Thread Stephen C. Tweedie

Hi,

On Thu, 23 Dec 1999 02:37:48 +0300, Hans Reiser [EMAIL PROTECTED]
said:

  I completly agree to change mark_buffer_dirty() to call balance_dirty()
  before returning.

 How can we use a mark_buffer_dirty that calls balance_dirty in a
 place where we cannot call balance_dirty?

It shouldn't be impossible: as long as we are protected against
recursive invocations of balance_dirty (which should be easy to
arrange) we should be safe enough, at least if the memory reservation
bits of the VM/fs interaction are working so that the balance_dirty
can guarantee to run to completion.

--Stephen



Re: archive

1999-12-23 Thread Stephen C. Tweedie

Hi,

On Wed, 22 Dec 1999 11:08:37 -0800, "sadri" [EMAIL PROTECTED] said:

 Is there an archive of the emails posted in this list(linux-fsdevel)?
 thanks 

Searching for "linux-fsdevel archive" on www.google.com found several.

--Stephen



Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?

1999-12-21 Thread Stephen C. Tweedie

Hi,

On Tue, 21 Dec 1999 14:57:29 +0100 (CET), Andrea Arcangeli
[EMAIL PROTECTED] said:

 So you are talking about replacing this line:
   dirty = size_buffers_type[BUF_DIRTY]  PAGE_SHIFT;
 with:
   dirty = (size_buffers_type[BUF_DIRTY]+size_buffers_type[BUF_PINNED])  
PAGE_SHIFT;

Basically yes, but I was envisaging something slightly different from
the above.

There may well be data which is simply not in the buffer cache at all
but which needs to be accounted for as pinned memory.  A good example
would be if some filesystem wants to implement deferred allocation of
disk blocks: the corresponding pages in the page cache obviously cannot
be flushed to disk without generating extra filesystem activity for the
allocation of disk blocks to pages.  The pages must therefore be pinned,
but as they don't yet have disk mappings we can't assume that they are
in the buffer cache.

So we really need a pinned page threshold which can apply to general
pages, not necessarily to the buffer cache.


There's another issue, though.  BUF_DIRTY buffers do not necessarily
count as pinned in this context: they can always be flushed to disk
without generating any significant new memory allocation pressure.  We
still need to do write-throttling, so we need a threshold on dirty data
for that reason.  However, deferred allocation and transactions actually
have a more subtle and nastier property: you cannot necessarily flush
the pages from memory without first allocating more memory.

In the transaction case this is because you have to allow transactions
which are already in progress to complete before you can commit the
transaction (you cannot commit incomplete transactions because that
would defeat the entire point of a transactional system!).  In the case
of deferred disk block allocation, the problem is that flushing the
dirty data requires extra filesystem operations as we allocate disk
blocks to pages.

In these cases we need to be able to make sure that not only does pinned
memory never exceed a threshold, we also have to ensure that the
*future* allocations required to flush the existing allocated memory can
also be satisfied.  We need to allow filesystems to "reserve" such extra
memory, and we need a system-wide threshold on all such reservations.

The ext3 journaling code already has support for reservations, but
that's currently a per-filesystem parameter.  We still have need for a
global VM reservation to prevent memory starvation if multiple different
filesystems have this behaviour.


Note that what we need here isn't complex: it's no more than exporting
atomic_t counts of the number of dirty and reserved pages in the system
and supporting a maximum threshold on these values via /proc.  The
mechanism for observing these limits can be local to each filesystem: as
long as there is an agreed counter in the VM where they can register
their use of memory.

--Stephen



RFC: Re: journal ports for 2.3?

1999-12-20 Thread Stephen C. Tweedie

Hi,

All comments welcome: this is a first draft outline of what I _think_
Linus is asking for from journaling for mainline kernels.

On Wed, 15 Dec 1999 13:45:22 -0500, Chris Mason
[EMAIL PROTECTED] said:

 What is your current plan for porting ext3 into 2.3/2.4?  Are you still
 going to be buffer cache based, or do you plan on moving every thing into
 the page cache?

For 2.4 the first release will probably still be in the buffer cache,
but I'm resigned to the fact that Linus won't accept it for a final
merge until it uses an alternative method.

I'd like to talk to you about that if possible.  Right now, it looks
as if the following is the absolute minimum required to make ext3,
reiserfs and any unknown future journaled fs'es work properly in 2.3:

  * Add an extra "async" parameter to super_operations-write_super()
to distinguish between bdflush and sync()

  * Clean up the rules for allowing the raid5 code to snoop the buffer
cache: raid5 should consider a buffer locked and transient if it
has b_count raised

  * The raid resync code needs to be atomic wrt. ll_rw_block()

  * Whatever caching mechanism we use --- page cache or something else
--- we *must* allow the VM to make callbacks into the filesystem
to indicate memory pressure.  There are two cases: first, when
memory gets short, we need to be able to request flush-from-memory
(including clean pages) secondly, if we detect too many dirty
buffers, we need to be able to request flush-to-disk (without
necessarily reclaiming memory, but causing a stall on the calling
process to act as a throttle on heavy write traffic).

For the out-of-memory pressure, ideally all we need is a callback on
the page-mapping address_space.  We have one address space per
inode, so adding a struct as_operations to the address_space would
only grow our tables by one pointer per inode, not one pointer per
pages.

Shrink_mmap() can easily use such a pointer to perform any
filesystem-specific tearing-down of the page.


The second case is a little more tricky: currently the only
mechanism we have for write throttling under heavy write load is the
refile_buffer() checks in buffer.c.  Ideally there should be a
system-wide upper bound on dirty data: if each different filesystem
starts to throttle writes at 50% of physical memory then you only
need two different filesystems to overcommit your memory badly.

A PG_Dirty flag, a global counter of dirty pages and a system-wide
dirty memory threshold would be enough to allow ext3 and reiserfs to
perform their own write throttling in a way which wouldn't fall
apart if both ext3 and reiserfs were rpesent in the system at the
same time.  Making the refile_buffer() checks honour that global
threshold would be trivial.  

The PG_Dirty flag would also allow for VM callbacks to be made to
the filesystems if it was determined that we needed the dirty memory
pages for some other use (as already happens in the buffer cache if
try_to_free_buffers fails and wakes up bdflush).  Such a callback
should also be triggered off the address_space.

There are lots of other things which would be useful to journaling, such
as the ll_rw_block-level write ordering enforcement and write barrier,
but the above is really the minimum necessary to actually get the things
to _work_ without intruding into the buffer cache and without destroying
the system's performance if journaled transactions are allowed to grow
without VM back-pressure.


Cheers,
 Stephen



Re: Oops with ext3 journaling

1999-12-08 Thread Stephen C. Tweedie

Hi,

On Wed, 8 Dec 1999 17:28:49 -0500, "Theodore Y. Ts'o" [EMAIL PROTECTED]
said:

 Never fear, there will be an very easy way to switch back and forth
 between ext2 and ext3.  A single mount command, or at most a single
 tune2fs command, should be all that it takes, no matter how the
 journal is stored.

Absolutely.  I am 100% committed to this.  Apart from anything else,
this is the mechanism which will allow for incompatible revisions of
the ext3 journal format: you will always be able to mount a journaled
ext3 partition as ext2, and then remount as the new ext3, if you want
to upgrade ext3 partitions to a new, incompatible format (which should
not happen after final release, but there will be at least one such
incompatible format revision required in the next month or so).

Any such format changes will be limited to the journal format: there
will be no journaling changes which prevent backing the filesystem
revision back down to ext2.  Ever.

--Stephen



Re: Oops with ext3 journaling

1999-12-06 Thread Stephen C. Tweedie

Hi,

On Sat, 4 Dec 1999 08:44:46 -0800 (PST), Brion Vibber
[EMAIL PROTECTED] said:

 Maybe at least stick a nice big warning in the docs along the lines of
 "do not write to your journal file while mounted with journaling on,
 you big dummy!" :) Not that I'd do so deliberately of course, but it
 might make people a little more wary later on. The README does
 recommend setting the permissions to 400, but that doesn't protect
 from root of course.

"chattr +i journal.dat" will protect even from root.

 Is there any reason you _would_ want to be able to write to the journal
 file from userland while mounted? I'd guess no...

No, and I'm pretty much convinced now that I'll move to having a
private, hidden inode for the journal in the future.

--Stephen



Re: Oops with ext3 journaling

1999-12-06 Thread Stephen C. Tweedie

Hi,

On Sat, 4 Dec 1999 12:11:58 -0700, mike burrell [EMAIL PROTECTED] said:

 couldn't you just make a new flag for the inode that journal.dat uses?  i'm
 guessing using S_IMMUTABLE will cause some problems, but something similar
 to that?

The immutable flag will work fine: journaling bypasses the normal file
write mechanisms, so it will work with an "immutable" journal quite
safely.

--Stephen



Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Stephen C. Tweedie

Hi,

On Wed, 3 Nov 1999 10:30:36 +0100 (MET), Ingo Molnar
[EMAIL PROTECTED] said:

 OK... but raid resync _will_ block forever as it currently stands.

 {not forever, but until the transaction is committed. (it's not even
 necessary for the RAID resync to wait for locked buffers, it could as well
 skip over locked  dirty buffers - those will be written out anyway.) }

No --- it may block forever.  If I have a piece of busy metadata (eg. a
superblock) which is constantly being updated, then it is quite possible
that the filesystem will modify and re-pin the buffer in a new
transaction before the previous transaction has finished committing.
(The commit happens asynchronously after the transaction has closed, for
obvious performance reasons, and a new transaction can legitimately
touch the buffer while the old one is committing.)

 i'd like to repeat that i do not mind what the mechanizm is called,
 buffer-cache or physical-cache or pagecache-II or whatever. The fact that
 the number of 'out of the blue sky' caches and IO is growing is a _design
 bug_, i believe.

Not a helpful attitude --- I could just as legitimately say that the
current raid design is a massive design bug because it violates the
device driver layering so badly.

There are perfectly good reasons for doing cache-bypass.  You want to
forbid zero-copy block device IO in the IO stack just because it is
inconvenient for software raid?  That's not a good answer.

Journaling has a good example: I create temporary buffer_heads in
order to copy journaled metadata buffers to the log without copying
the data.  It is simply not possible to do zero-copy journaling if I
go through the cache, because you'd be wanting me to put the same
buffer_head on two different hash lists for that.  Ugh.  

 But whenever a given page is known to be 'bound' to a given physical block
 on a disk/device and is representative for the content, 

*BUT IT ISN'T* representative of the physical content.  It only
becomes representative of that content once the buffer has been
written to disk.

Ingo, you simply cannot assume this on 2.3.  Think about memory mapped
shared writable files.  Even if you hash those pages' buffer_heads
into the buffer cache, those writable buffers are going to be (a)
volatile, and (b) *completely* out of sync with what is on disk ---
and the buffer_heads in question are not going to be marked dirty.

You can get around this problem by snapshotting the buffer cache and
writing it to the disk, of course, but if you're going to write the
whole stripe that way then you are forcing extra IOs anyway (you are
saving read IOs for parity calcs but you are having to perform extra
writes), and you are also going to violate any write ordering
constraints being imposed by higher levels.

 it's a partial cache i agree, but in 2.2 it is a good and valid way to
 ensure data coherency. (except in the swapping-to-a-swap-device case,
 which is a bug in the RAID code)

... and --- now --- except for raw IO, and except for journaling.

 sure it can. In 2.2 the defined thing that prevents dirty blocks from
 being written out arbitrarily (by bdflush) is the buffer lock. 

Wrong semantics --- the buffer lock is supposed to synchronise actual
physical IO (ie. ll_rw_block) and temporary states of the buffer.  It
is not intended to have any relevance to bdflush.

 I'll not pretend that this doesn't pose difficulties for raid, but
 neither do I believe that raid should have the write to be a cache
 manager, deciding on its own when to flush stuff to disk.

 this is a misunderstanding! RAID will not and does not flush anything to
 disk that is illegal to flush.

raid resync currently does so by writing back buffers which are not
marked dirty.

 There is a huge off-list discussion in progress with Linus about this
 right now, looking at the layering required to add IO ordering.  We have
 a proposal for per-device IO barriers, for example.  If raid ever thinks
 it can write to disk without a specific ll_rw_block() request, we are
 lost.  Sorry.  You _must_ observe write ordering.

 it _WILL_ listen to any defined rule. It will however not be able to go
 telephatic and guess any future interfaces.

There is already a defined rule, and there always has been.  It is
called ll_rw_block().  Anything else is a problem.

 I'd like to have ways to access mapped  valid cached data from the
 physical index side.

You can't.  You have never been able to assume that safely.

Think about ext2 writing to a file in any kernel up to 2.3.xx.  We do
the following:

getblk()
ll_rw_block(READ) if it is a partial write
copy_from_user()
mark_buffer_dirty()
update_vm_cache()

copy_from_user has always been able to block.  Do you see the problem?
We have wide-open windows in which the contents of the buffer cache have
been modified but the buffer is not marked dirty.  With 2.3, things just
get worse for you, with writable shared mappings dirtying things

Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 1 Nov 1999 13:04:23 -0500 (EST), Ingo Molnar [EMAIL PROTECTED]
said:

 On Mon, 1 Nov 1999, Stephen C. Tweedie wrote:
 No, that's completely inappropriate: locking the buffer indefinitely
 will simply cause jobs like dump() to block forever, for example.

 i dont think dump should block. dump(8) is using the raw block device to
 read fs data, which in turn uses the buffer-cache to get to the cached
 state of device blocks. Nothing blocks there, i've just re-checked
 fs/block_dev.c, it's using getblk(), and getblk() is not blocking on
 anything.

fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed
by a wait_on_buffer().  It blocks.

 (the IO layer should and does synchronize on the bh lock) 

Exactly, and the lock flag should be used to synchronise IO, _not_ to
play games with bdflush/writeback.  If we keep buffers locked, then raid
resync is going to stall there too for the same reason ---
wait_on_buffer() will block.

 However, you're missing a much more important issue: not all writes go
 through the buffer cache.

 Currently, swapping bypasses the buffer cache entirely: writes from swap
 go via temporary buffer_heads to ll_rw_block.  The buffer_heads are

 we were not talking about swapping but journalled transactions, and you
 were asking about a mechanizm to keep the RAID resync from writing back to
 disk.

It's the same issue.  If you arbitrarily write back through the buffer
cache while a swap write IO is in progress, you can wipe out that swap
data and corrupt the swap file.  If you arbitrarily write back journaled
buffers before journaling asks you to, you destroy recovery.  The swap
case is, if anything, even worse: it kills you even if you don't take a
reboot, because you have just overwritten the swapped-out data with the
previous contents of the buffer cache, so you've lost a write to disk.

Journaling does the same thing by using temporary buffer heads to write
metadata to the log without copying the buffer contents.  Again it is IO
which is not in the buffer cache.

There are thus two problems: (a) the raid code is writing back data
from the buffer cache oblivious to the fact that other users of the
device may be writing back data which is not in the buffer cache at all,
and (b) it is writing back data when it was not asked to do so,
destroying write ordering.  Both of these violate the definition of a
device driver.

 The RAID layer resync thread explicitly synchronizes on locked
 buffers. (it doesnt have to but it does) 

And that is illegal, because it assumes that everybody else is using the
buffer cache.  That is not the case, and it is even less the case in
2.3.

 You suggested a new mechanizm to mark buffers as 'pinned', 

That is only to synchronise with bdflush: I'd like to be able to
distinguish between buffers which contain dirty data but which are not
yet ready for disk IO, and buffers which I want to send to the disk.
The device drivers themselves should never ever have to worry about
those buffers: ll_rw_block() is the defined interface for device
drivers, NOT the buffer cache.

 In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the
 buffer cache. [...]

 the RAID code has major problems with 2.3's pagecache changes. 

It will have major problems with ext3 too, then, but I really do think
that is raid's fault, because:

 2.3 removes physical indexing of cached blocks, 

2.2 never guaranteed that IO was from cached blocks in the first place.
Swap and paging both bypass the buffer cache entirely.  To assume that
you can synchronise IO by doing a getblk() and syncing on the
buffer_head is wrong, even if it used to work most of the time.

 and this destroys a fair amount of physical-level optimizations that
 were possible. (eg. RAID5 has to detect cached data within the same
 row, to speed up things and avoid double-buffering. If data is in the
 page cache and not hashed then there is no way RAID5 could detect such
 data.)

But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
a swapon, then the swapper will start to write to that swapfile using
temporary buffer_heads.  If you do IO or checksum optimisation based on
the buffer cache you'll risk plastering obsolete data over the disks.  

 i'll probably try to put pagecache blocks on the physical index again
 (the buffer-cache), which solution i expect will face some resistance
 :)

Yes.  Device drivers should stay below ll_rw_block() and not make any
assumptions about the buffer cache.  Linus is _really_ determined not to
let any new assumptions about the buffer cache into the kernel (I'm
having to deal with this in the journaling filesystem too).

 in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules.
 The buffer-cache represents all cached (dirty and clean) blocks within the
 system. 

It does not, however, represent any non-cached IO.

 If there are other block caches in the system (the page-cache in 2.2
 was readonly, thus no

Re: wierdisms w/ ext3.

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 1 Nov 1999 15:03:54 -0600, Timothy Ball
[EMAIL PROTECTED] said:

 I did my best to try to follow what the README for ext3 said. I made a
 journal file in /var/local/journal/journal.dat. It has an inode # of
 183669.

 Then I did /sbin/lilo -R linux rw rootflags=journal=183669. 

Silly question, but is /var/local/journal on the same filesystem as the
root?  Those rootflags look fine otherwise.

--Stephen



Re: wierdisms w/ ext3.

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Tue, 2 Nov 1999 03:10:10 -0600, Timothy Ball
[EMAIL PROTECTED] said:

 Here's the info from /var/log/dmesg. Could it be that my journal file
 has a large inode number? And if you have more than one ext3 partition
 can you have more than one journal file? How would you specify it...
 must read code... 

You need one per filesystem, and you register it when you mount the
filesystem.  For non-root filesystems, just umount and remount with the
"-o journal=xxx" flag.  For the root filesystem you need the rootflags=
trick.

--Stephen



Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 01 Nov 1999 15:53:29 -0500, Jeff Garzik
[EMAIL PROTECTED] said:

 XFS delays allocation of user data blocks when possible to
 make blocks more contiguous; holding them in the buffer cache.
 This allows XFS to make extents large without requiring the user
 to specify extent size, and without requiring a filesystem
 reorganizer to fix the extent sizes after the fact. This also
 reduces the number of writes to disk and extents used for a file. 

 Is this sort of manipulation possible with the existing buffer cache?

Absolutely not, but it is not hard with the page cache.  The main thing
missing is a VM callback to allow memory pressure to force unallocated,
pinned pages to disk.

--Stephen



Re: Buffer and page cache

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Tue, 02 Nov 1999 08:15:36 -0700, [EMAIL PROTECTED] said:

 I'd like these pages to age a little before handing them over to the
 "inode disk", because the "write_one_page" function called by
 generic_file_write would incur significant latency if the inode disk is
 "real", ie. not simulated in the same system.

The write-page method is only required to queue the data for writing to
the media.  It is not required to complete the physical IO, so the
filesystem can use any mechanism it likes to keep those pages queued for
eventual physical IO (just as 2.3 uses the buffer lists to queue that
data for eventual writeback via bdflush).

 So we have a page cache for the inodes in the file system where the
 pages become dirty - but no buffers are attached.  It reminds of a
 shared mapping, but there is no vma for the pages.

Fine.

 What appears to be needed is the following - probably it's mostly
 lacking in my understanding, but I'd appreciate to be advised how to
 attack the following points:

 - a bit to keep shrink_mmap away from the page.  

Yes, bumping the page count is the perfect way to do this.

 - a bit for a struct page that indicates the page needs to be written. 
 From block_write_full_page one could think that the PageUptoDate bit is
 maybe the one to use.  But does that really describe that this page is
 "dirty" - as it is done for buffers.

PageUpToDate can't be used: it is needed to flag whether the contents of
the page are valid for a read.  A written page must always be uptodate:
!uptodate implies that we have created the page but are still reading it
in from disk (or that the readin failed for some reason).

 - some indication of aging: we would like a pgflush daemon to walk the
 dirty pages of the file system and write them back _after_ a little
 while 

The fs should be able to manage that on its own.  If you queue all of
the pages which have been sent to the writepage() method, then you can
flush to the physical disk whenever you want.  A trivial bdflush
lookalike in the fs itself can deal with that.

You might well want a filesystem-private pointer in the page struct off
which to hook any fs-specific data (such as your dirty page linked list
pointers and the dirty flag).  You will also need a way for the VM to
exert memory pressure on those pages if it needs to reclaim memory.

These are both things which ext3 will want anyway, so we should make
sure that any infrastructure that gets put in place for this gets
reviewed by all the different fs groups first.

--Stephen



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-01 Thread Stephen C. Tweedie

Hi,

On Fri, 29 Oct 1999 14:06:24 -0400 (EDT), Ingo Molnar [EMAIL PROTECTED]
said:

 On Fri, 29 Oct 1999, Stephen C. Tweedie wrote:

 Fixing this in raid seems far, far preferable to fixing it in the
 filesystems.  The filesystem should be allowed to use the buffer cache
 for metadata and should be able to assume that there is a way to prevent
 those buffers from being written to disk until it is ready.

 why dont you lock the buffer during transaction, thats the right way to
 'pin' down a buffer and prevent it from being written out. You can keep a
 buffer locked  dirty indefinitely, and it should be easy to unlock them
 when comitting a transaction. Am i missing something? 

No, that's completely inappropriate: locking the buffer indefinitely
will simply cause jobs like dump() to block forever, for example.
However, you're missing a much more important issue: not all writes go
through the buffer cache.

Currently, swapping bypasses the buffer cache entirely: writes from swap
go via temporary buffer_heads to ll_rw_block.  The buffer_heads are
never part of the buffer cache and are discarded as soon as IO is
complete.  The same mechanism is used when reading to the page cache,
but that's probably safe enough as writes do use the buffer cache in
2.2.

In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the
buffer cache.  The buffer_heads do persist, but they overlay the page
cache, not the buffer cache --- they do not appear on the buffer cache
hash lists.  You _cannot_ synchronise with these writes at the buffer
cache level.  If your raid resync collides with such a write, it is
entirely possible that the filesystem write will occur between the raid
read and the raid write --- you will corrupt ext2 files.

In my own case right now, ext3 on 2.2 behaves much like ext2 does on 2.3
--- it uses temporary buffer_heads to write directly to the journal, and
so is going to be bitten by the raid resync behaviour.

Basically, device drivers cannot assume that all IOs come from the
buffer cache --- raid has to work at the level of ll_rw_block, not at
the level of the buffer cache.

--Stephen




Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-01 Thread Stephen C. Tweedie

Hi,

On Mon, 01 Nov 1999 15:58:33 -0600, [EMAIL PROTECTED] said:

 I agree with this, it feels closer to the linux page cache, the
 terminology in the XFS white paper is a little confusing here.

 XFS on Irix caches file data in buffers, but not in the regular buffer
 cache, they are cached off the vnode and organized by logical file
 offset rather than by disk block number, 

This is describing the job done by the page cache, not the buffer cache,
in Linux.

 the memory in these buffers comes from the page subsystem, the page
 tag being the vnode and file offset. These buffers do not have to have
 a physical disk block associated with them, XFS allows you to reserve
 blocks on the disk for a file without picking which blocks. At some
 point when the data needs to be written (memory pressure, or sync
 activity etc)

The main thing we'd want to establish to support this fully in Linux is
exactly this --- what VM callbacks do you need here?  Memory pressure is
not currently something that gets fed back adequately to the
filesystems, and I'll be needing similar feedback for ext2 journaling
(we need to be told to do early commits if the memory used by a
transaction is required elsewhere).

The details of the IOs themselves should be capable of being handled
perfectly well within the existing device driver abstraction: you don't
need to use the buffer cache.

--Stephen



Raid resync changes buffer cache semantics --- not good for journaling!

1999-10-29 Thread Stephen C. Tweedie

Hi all,

There seems to be a conflict between journaling filesystem requirements
(both ext3 and reiserfs), and the current raid code when it comes to
write ordering in the buffer cache.

The current ext3 code adds debugging checks to ll_rw_block designed to
detect any cases where blocks are being written to disk in an order
which breaks the filesystem's transaction ordering guarantees.  

A couple of hours ago it was triggered during a test run here by the
raid background resync daemon.

Raid resync basically works by reading, and rewriting, the entire raid
device stripe by stripe.  The write pass is unconditional.  Even if the
block is marked as reserved for journaling, and so is bypassed by
bdflush, even if the block is clean: it gets written to disk.

ext3 uses a separate buffer list for journaled buffers to avoid bdflush
writing them back early.  As I understand it (correct me if I'm wrong,
Chris), reiserfs journaling simply avoids setting the dirty bit on the
buffer_head until the log record has been written.  Neither case stops
raid resync from flushing the buffer to disk.

As far as I can see, the current raid resync simply cannot observe any
write ordering requirements being placed on the buffer cache.  This is
something which will have to be addressed in the raid code --- the only
alternative appears to be to avoid placing any uncommitted transactional
data in the buffer cache at all, which would require massive rewrites of
ext3 (and probably no less trauma in reiserfs).

This isn't a bug in either the raid code or the journaling --- it's just
that the raid code changes semantics which non-journaling filesystems
don't care about.  Journaling adds extra requirements to the buffer
cache, and raid changes the semantics in an incompatible way.  Put the
two together and you have serious problems during a background raid
sync.

Ingo, can we work together to address this?  One solution would be the
ability to mark a buffer_head as "pinned" against being written to disk,
and to have raid resync use a temporary buffer head when updating that
block and use the on-disk copy, not the in-memory one, to update the
disk (guaranteeing that the in-memory copy doesn't hit disk).  You will
have a much better understanding of the locking requirements necessary
to ensure that the two copies don't cause mayhem, but I'm willing to
help on the implementation.  

Fixing this in raid seems far, far preferable to fixing it in the
filesystems.  The filesystem should be allowed to use the buffer cache
for metadata and should be able to assume that there is a way to prevent
those buffers from being written to disk until it is ready.

--Stephen



Re: [ext3-0.0.2b] no-go with block size 4k

1999-10-29 Thread Stephen C. Tweedie

Hi,

On Thu, 28 Oct 1999 21:29:44 +0200, Marc Mutz [EMAIL PROTECTED] said:

 Hi Stephen!
 I just tried your journalling support with my old spare scsi disk
 (240M). The things I tried were:

 Oct 28 21:08:57 adam kernel: Journal length (768 blocks) too short.

Your journal is too short.  The jfs layer expects to have at least 1024
blocks available, which means it needs a minimum of 4MB of journal with
a 4k blocksize.

This sort of problem will all go away once the user-mode tools for
setting up journals automatically are in place!

--Stephen



Re: ext3 - filesystem is not clean after recovery

1999-10-26 Thread Stephen C. Tweedie

Hi,

On Tue, 26 Oct 1999 10:19:13 +0200, [EMAIL PROTECTED]
(Miklos Szeredi) said:

 Hi,
 Sorry, I forgot to say, that it was with 0.0.2b. Also I reproduced
 this twice, so the second time, it _was_ a clean fs before converting
 to ext3.

Are you sure you applied _both_ 0.0.2a and 0.0.2b, not just 0.0.2b?
This sounds exactly like a problem in truncate on 0.0.2a.  I
absolutely cannot reproduce any recovery failures with both applied.
Alternatively, are you running on software raid?  There's a possible
raid problem which affects ext3 which I'm chasing right now.

--Stephen



Re: ext3 - filesystem is not clean after recovery

1999-10-26 Thread Stephen C. Tweedie

Hi,

On Tue, 26 Oct 1999 14:56:50 +0200, [EMAIL PROTECTED]
(Miklos Szeredi) said:

 I will try to make more tests with a cleaner configuration...

OK, thanks --- the more information you can provide, the better.  A
reliable reproducer for any problems would be best of all.

--Stephen



Re: ext3 - filesystem is not clean after recovery

1999-10-25 Thread Stephen C. Tweedie

Hi,

On Mon, 25 Oct 1999 18:41:09 +0200, [EMAIL PROTECTED]
(Miklos Szeredi) said:

 5) boot, then mount ext3 filesystem - it says:
   JFS DEBUG: (recovery.c, 411): journal_recover: JFS: recovery, exit status 0, 
recovered transactions 130 to 133
 6) unmount the fs, and with debugfs turn off journalling
 7) "e2fsck -f" the filesystem

 And fsck returned some errors. Is this normal?

Which version of ext3?  If you haven't applied the 0.0.2a and 0.0.2b
patches, then yes, there have been bugs fixed which might have caused
this.  Another possible danger is applying a journal to a filesystem
which is not completely clean and which has undetected errors ---
eventually the user mode tools for journal creation will be advanced
enough not to allow creation of a journal on an unclean filesystem.

--Stephen



Re: ext3-0.0.2a patch released

1999-10-20 Thread Stephen C. Tweedie

Hi,

On Tue, 19 Oct 1999 09:50:59 -0400, Daniel Veillard
[EMAIL PROTECTED] said:

   The oops of the day :

 Oct 19 05:42:50 fr kernel: Assertion failure in journal_get_write_access() at 
transaction.c line 436: "handle-h_buffer_credits  0" 
...
 Oct 19 05:42:50 fr kernel: Call Trace: [cprt+18806/35456] [cprt+19583/35456] 
[ext3_write_inode+197/424] [sync_all_inodes+131/188] [try_to_free_inodes+39/56] 
[grow_inodes+30/364] [get_empty_inode+145/156]  

The problem is that an existing transaction is running before we call
ext3_new_inode(), and that get_empty_inode() has decided that there
are too many dirty inodes and that some need to be flushed to disk.
We're in the middle of an existing transaction, so those flushes get
accounted against the running transaction handle.  Oops.

This should be taken care of in 0.0.2b.  I've put a prerelease of that
up for ftp, and I'll make it official once it has had a bit more
testing.

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-19 Thread Stephen C. Tweedie

Hi,

On 19 Oct 1999 00:44:38 -0500, [EMAIL PROTECTED] (Eric
W. Biederman) said:

 Meanwhile having the metadata in the page cache (where they would
 have predictable offsets by file size) 

Doesn't help --- you still need to look up the physical block numbers
in order to clear the allocation bitmaps for indirect blocks, so we're
going to need those lookups anyway.  Once you have that, looking for a
given fixed offset in the buffer cache is no harder than doing so in
the page cache.

 would also speed handle partial truncates . . .  [ Either case with
 a little gentle handling will handle speed up fsync] (At least for
 ext2 type filesystems).

No argument there, but a per-inode dirty buffer list would do the
same.  Neither of these are overwhelming arguments for a move.

--Stephen



Re: Announce: ext2+journaling, release 0.0.2

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Fri, 15 Oct 1999 14:04:48 +, Peter Rival [EMAIL PROTECTED]
said:

 Well, I think I just uncovered the first, umm, detail for this
 release ;) Got the following while trying to start an AIM VII fserver
 run on an AlphaPC164.  Disks are all 2GB narrow SCSI, hanging off of a
 single chain from a KZPCM (aka symbios 53c875).  Any hints (other than
 "don't run AIM yet" ;)?  After this, the multitask process went into a
 D state and was unkillable.  Help?

 1JFS unimplemented function journal_release_buffer

Should be fixed in 0.0.2a (coming out in the next hour or two).

Thanks,
 Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Sat, 16 Oct 1999 01:59:38 -0400 (EDT), Alexander Viro
[EMAIL PROTECTED] said:

a) to d), fine.

   e) we might get out with just a dirty blocks lists, but I think
 that we can do better than that: keep per-inode cache for metadata. It
 is going to be separate from the data pagecache. It is never exported to
 user space and lives completely in kernel context. Ability to search
 there doesn't mean much for normal filesystems, but weirdies like AFFS
 will _really_ benefit from it - I've just realized that I was crufting up
 an equivalent of such cache there anyway.

Why?  Whenever we are doing a lookup on such data, we are _always_
indexing by physical block number: what's wrong with the normal buffer
cache in such a case?

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Mon, 18 Oct 1999 14:30:10 +0200 (CEST), Andrea Arcangeli
[EMAIL PROTECTED] said:

 I can't see these bigmem issues. The buffer and page-cache memory is not
 in bigmem anyway. And you can use bigmem _wherever_ you want as far as you
 remeber to fix all the involved code to kmap before read/write to
 potential bigmem memory. bigmem issue looks like a red-herring to me.

Data and metadata are completely different.  On, say, a large and busy
web or ftp server, you really don't care about a 1G metadata limit, but
a 1G page cache limit is much more painful.  

Secondly, data is accessed by user code: mmap of highmem page cache
pages can work automatically via normal ptes, and read()/write() of such
pages requires kmap (or rather, something a little more complex which
can accept a page fault mid-copy) in a very few, well-defined places in
filemap.c.  Metadata, on the other hand, is extremely hot data, being
accessed randomly at high frequency _everywhere_ in the filesystems.

It's a much, much messier job to teach the filesystems about high-memory
buffer cache than to teach the filemap about high-memory page cache, and
the page cache is the one under the most memory pressure in the first
place.

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Mon, 18 Oct 1999 13:26:45 -0400 (EDT), Alexander Viro
[EMAIL PROTECTED] said:

 You can't even know which is the inode Y that is using a block X without
 reading all the inode metadata while the block X still belongs to the
 inode Y (before the truncate).

 WTF would we _need_ to know? Think about it as about memory caching. You
 can cache by virtual address and you can cache by physical address. And
 since we have no aliasing here... 

We still have the underlying problem.  The hash lists used for buffer
cache lookup are only part of the buffer cache data structures.  The
other part --- the buffer lists --- are independent of the namespace you
are using.  If you have a dirty buffer in an inode's metadata cache,
then yes, you still need to deal with the aliasing left when that data
is deallocated unless you explicitly revoke (bforget) that buffer as
soon as it is deallocated.

 Right now you know only that the stale buffer can't came from inode
 metadata as we have the race prone slowww trick to loop in polling mode
 inside ext2_truncate until we'll be sure bforget will run in hard mode.

 And? Don't use bread() on metadata. It should never enter the buffer hash,
 just as buffer_heads of _data_ never enter it. Leave bread() for
 statically allocated stuff.

It's write, not read, which is the problem: the danger is that you have
a dirty metadata buffer which is deleted and reallocated as data, but
the buffer is still dirty in the buffer cache and can stomp on your
freshly allocated data buffer later on.

The cache coherency isn't the problem as much as _disk_ coherency.
 cache coheren

 And even if you know such information you should do a linear search in the
 list that's slower than an hash lookup anyway (or you can start slowly
 unmapping all the other stale buffers even if you are not interested
 about, so you may block more than necessary). Doing a per-object hash is
 not an option IMHO.

 You are kinda late with it - it's already done for data. The question
 being: do we ever need lookups by physical address (disk location)?
 AFAICS it is _not_ needed.

 d) Moving the second class into the page cache will cause problems
 with bigmem stuff. Besides, I have the reasons of my own for keeping those
 
 I can't see these bigmem issues. The buffer and page-cache memory is not
 in bigmem anyway. And you can use bigmem _wherever_ you want as far as you
 remeber to fix all the involved code to kmap before read/write to
 potential bigmem memory. bigmem issue looks like a red-herring to me.

 Right. And you may want different policies for data and metadata - the
 latter should always be in kvm. AFAICS that's what SCT refered to.



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Mon, 18 Oct 1999 13:26:45 -0400 (EDT), Alexander Viro
[EMAIL PROTECTED] said:

 You can't even know which is the inode Y that is using a block X without
 reading all the inode metadata while the block X still belongs to the
 inode Y (before the truncate).

 WTF would we _need_ to know? Think about it as about memory caching. You
 can cache by virtual address and you can cache by physical address. And
 since we have no aliasing here... 

We still have the underlying problem.  The hash lists used for buffer
cache lookup are only part of the buffer cache data structures.  The
other part --- the buffer lists --- are independent of the namespace you
are using.  If you have a dirty buffer in an inode's metadata cache,
then yes, you still need to deal with the aliasing left when that data
is deallocated unless you explicitly revoke (bforget) that buffer as
soon as it is deallocated.

 Right now you know only that the stale buffer can't came from inode
 metadata as we have the race prone slowww trick to loop in polling mode
 inside ext2_truncate until we'll be sure bforget will run in hard mode.

 And? Don't use bread() on metadata. It should never enter the buffer hash,
 just as buffer_heads of _data_ never enter it. Leave bread() for
 statically allocated stuff.

It's write, not read, which is the problem: the danger is that you have
a dirty metadata buffer which is deleted and reallocated as data, but
the buffer is still dirty in the buffer cache and can stomp on your
freshly allocated data buffer later on.

The cache coherency isn't the problem as much as _disk_ coherency.
Cache coherency in this case doesn't hurt as long as the wrong version
never hits disk (because the first thing that will happen if a stale
metadata buffer gets reallocated as new metadata will be that the buffer
in memory gets cleared anyway).

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On 18 Oct 1999 08:20:51 -0500, [EMAIL PROTECTED] (Eric W. Biederman)
said:

 And I still can't see how you can find the stale buffer in a
 per-object queue as the object can be destroyed as well after the
 lowlevel truncate.

 Yes but you can prevent the buffer from becomming a stale buffer with the
 per-object queue.

That still doesn't let you get rid of the bforget(): remember that a
partial truncate() still needs to be dealt with, and you can't work out
which buffers to discard in that case without doing the full metadata
walk.  Just having a per-inode buffer queue doesn't help there (although
it _would_ help solve the case which we currently get wrong, which is
directory data blocks, since we never partially truncate a directory).

--Stephen



Re: journal requirements for buffer.c

1999-10-14 Thread Stephen C. Tweedie

Hi,

On Wed, 13 Oct 1999 02:19:19 +0400, Hans Reiser [EMAIL PROTECTED] said:

 I merely hypothesize that the maximum value of required
 FLUSHTIME_NON_EXPANDING will usually be less than 1% of memory, and
 therefor won't have an impact.  It is not like keeping 1% of memory
 around for use by text segments and other FLUSHTIME_NON_EXPANDING
 buffers is likely to be a bad thing.

That's probably enough for journaled filesystems, but with deferred
allocation it definitely is not.  If you have a lot of data to commit,
then I guess that the tree operations required to push many tens of MB
of data to disk could well exceed that 1%.

 It should definitely be possible to establish a fairly clean common
 kernel API for this.  Doing so would have the extra advantage that if
 you had mixed ReiserFS and XFS partitions on the same machine, the
 VM's memory reservation would be able to cope cleanly with multiple
 users of reserved memory.

 Ok, so we agree that we need it, and the details we are still refining.

Yes.

--Stephen



Re: (reiserfs) Re: journal requirements for buffer.c

1999-10-14 Thread Stephen C. Tweedie

Hi,

On Thu, 14 Oct 1999 14:31:23 +0400, Hans Reiser [EMAIL PROTECTED] said:

 Ah, I see, the problem is that when you batch the commits they can be
 truly huge, and they all have to commit for any of them to commit, and
 none of them can be flushed until they all commit, is that it?

Exactly.  And the worst part of it is that while the transactions are
still growing and atomic filesystem operations are still running, you
can't even tell for sure exactly how big the transaction is going to
get eventually.

--Stephen



Re: [RFC] truncate() (generic stuff)

1999-10-12 Thread Stephen C. Tweedie

Hi,

On Mon, 11 Oct 1999 11:12:01 -0400 (EDT), Alexander Viro
[EMAIL PROTECTED] said:

   I began screwing around the truncate() stuff and the following is
 a status report/request for comments:
   a) call of -truncate() method (and vmtruncate()) had been moved
 into the notify_change(). It is triggered if ATTR_SIZE is set. Modified
 places: do_truncate(), nfsd_truncate(), nfsd_setattr() and hpfs_unlink().
 Semantics had been changed (fixed, AFAICS) for nfsd_* - truncation happens
 with capabilities dropped. BTW, nfsd_setattr() didn't call vmtruncate().
 Fixed.

The page cache has very strict ordering requirements on the
inode-i_size update, the vmtruncate call and the actual file mapping
truncate operation.  I'd be nervous about letting the fs do too much of
that itself: what exactly is the rationale behind the change?

   coda,nfs,ncpfs,smbfs - I'm bringing the call of vmtruncate() into
 the end of -notify_change() (if ATTR_SIZE is set, indeed). It preserves
 the current behaviour, but I'm not sure that it's OK. Those filesystems
 don't have -truncate() and do the equivalent theselves (from
 - notify_change()). Maybe vmtruncate() should be called earlier.

Given that we are already looking at adding back in the inode write lock
for truncate, that shouldn't be a problem.

--Stephen



Re: [patch] [possible race in ext2] Re: how to write get_block?

1999-10-12 Thread Stephen C. Tweedie

Hi,

On Sat, 9 Oct 1999 23:53:01 +0200 (CEST), Andrea Arcangeli
[EMAIL PROTECTED] said:

 What I said about bforget in my old email is still true. The _only_ reason
 for using bforget instead of brelse is to get buffer performances (that in
 2.3.x are not so interesting as in 2.2.x as in 2.3.x flushpage is just
 doing the interesting stuff with the real data).

 The current design bug in 2.3.20pre2 and previous has nothing to do with
 bforget.

Nope.  We spent a fair amount of effort on this with the page cache
changes.  The ext2 truncate code is really, really careful to provide
bforget() with the correct conditions to get rid of the buffer: it is a
closely designed interaction between delete and bforget.  Remember, we
have exactly the same potential problems when freeing dirty indirect
blocks as we have when freeing directory blocks.

There's a big comment near the top of fs/ext2/truncate.c about the way
in which this works.

 The right fix is to do a query on the hash every time you overlap a
 buffer on the page cache.

Ouch --- that pays the penalty for normal data blocks every time you
pull them into cache.  No way.

The correct way to extend the current rules cleanly is to make the
truncate code do a bforget() on data blocks as well as indirect blocks
if, and only if, the file is not a regular file.  That will deal with
the symlink case too, and will mean that we are using the same mechanism
for all of our dynamically-allocatable metadata.

--Stephen



Re: [patch] [possible race in ext2] Re: how to write get_block?

1999-10-12 Thread Stephen C. Tweedie

Hi,

On 11 Oct 1999 17:58:54 -0500, [EMAIL PROTECTED] (Eric W. Biederman)
said:

 What about adding to the end of ext2_alloc_block:

 bh = get_hash_table(inode-i_dev, result, inode-i_sb-s_blocksize);
 /* something is playing with our fresh block, make them stop.  ;-) */
 if (bh) {
   if (buffer_dirty(bh)) { 
   mark_buffer_clean(bh);
   wait_on_buffer(bh);
   }
   bforget(bh);
 }

Again, it's a lot of extra unnecessary lookups.  The advantages of
having a dirty buffer list include being fast, and also massively
speeding up the metadata update part of fsync.

 Ultimately we really want to have indirect blocks, and
 the directory in the page cache as it should result in
 more uniform code, and faster partial truncates (as well as faster
 syncs).

There is one major potential future problem with moving this to the page
cache.  At some point I want to be able to extend the large (64G) memory
support on Intel to include the page cache in high memory.  The buffer
cache would still live in low memory.  If we do that, then moving
filesystem metadata out of permanently-mapped buffer memory and into the
page cache is going to complicate directory and indirect operations
significantly.

--Stephen



Re: [patch] [possible race in ext2] Re: how to write get_block?

1999-10-12 Thread Stephen C. Tweedie

Hi,

On Tue, 12 Oct 1999 15:39:35 +0200 (CEST), Andrea Arcangeli
[EMAIL PROTECTED] said:

 On Tue, 12 Oct 1999, Stephen C. Tweedie wrote:
 changes.  The ext2 truncate code is really, really careful to provide

 I was _not_ talking about ext2 at all. I was talking about the bforget and
 brelse semantics. As bforget fallback to brelse you shouldn't expect
 bforget to really destroy the buffer. It may do that but only if it's
 possible.

 You are using a _trick_ to let bforget to do the thing you want. If all
 filesystem needs such thing it's ugly having to duplicate that current
 ext2 tricks all over the place.

Andrea, you are just trying to relax carefully designed buffer cache
semantics which are relied upon by the current filesystems.  Saying it
is a trick doesn't help matters much.

 If you use bforget to avoid fs corruption you are asking for troubles and
 IMHO you are going in the wrong direction.

Umm, your last proposal was to do a hash lookup on each new page cache
buffer mapping.  That is a significant performance cost, which IMHO is
not exactly the right direction either. :)

 You may change bforget to do the right thing cleanly, but it will be badly
 blocking. While you shouldn't block unless you are the one who wants to
 reuse the buffer that is currently under I/O or dirty. So if you'll change
 bforget then you'll block in the wrong place.

Unfortunately, the plain fact is that the current page cache relies on
data blocks never being in the buffer cache hash lists, and certainly
never being dirty in that cache.  _That_ is the invariant we need to
preserve.  truncate already does that for regular files.  Please, let's
try to keep that clean invariant rather than increase the cost of normal
page cache operations.

--Stephen



Re: [patch] [possible race in ext2] Re: how to write get_block?

1999-10-11 Thread Stephen C. Tweedie

Hi,

On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli
[EMAIL PROTECTED] said:

 My point was that even being forced to do a lookup before creating
 each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash
 will contain only metadata. Less elements means faster lookups.

The _fast_ quick fix is to maintain a per-inode list of dirty buffers
and to invalidate that list when we do a delete.  This works for
directories if we only support truncate back to zero --- it obviously
gets things wrong if we allow partial truncates of directories (but why
would anyone want to allow that?!)

This would have minimal performance implication and would also allow
fast fsync() of indirect block metadata for regular files.

--Stephen



Re: RFC on raw IO implementation

1999-05-30 Thread Stephen C. Tweedie

Hi,

On Thu, 27 May 1999 22:18:50 -0700 (PDT), Linus Torvalds
[EMAIL PROTECTED] said:

 I care not one whit what the interface is on a /dev level

Fine, I can live with that!

 the only thing I care about is that the internal interfaces make sense
 (ie are purely based on kernel physical addresses, and have nothing at
 all to do with user virtual addresses).

 Having a simple translation layer to old-fashioned UNIX semantics makes
 sense, but doesn't mater from a kernel internals standpoint, so I don't
 find it all that interesting. 

Agreed, and the current code keeps that distinction clear.  The raw
device code generates kiobufs from user space but there is no trace of
the origin of those pages when we pass them to the IO layers: all
passing is done by containers of arbitrary physical pages.

 IF you think the translation layer matters to the internal implementation,
 then I can only say that something else is broken, and no, the patches
 wouldn't get accepted. 

That's fine --- it is purely the top-level API for backwards
compatibility raw character devices I was worried about, since that's
the only part of the code which may give rise to compatibility issues if
people start using the raw diffs against 2.2.

--Stephen



Re: We have a problem with O_SYNC

1999-05-30 Thread Stephen C. Tweedie

Hi,

On Thu, 27 May 1999 22:15:29 -0700 (PDT), Linus Torvalds [EMAIL PROTECTED] said:

 On Fri, 28 May 1999, Stephen C. Tweedie wrote:

 I have a patch I've been trying out to improve fsync performance by
 maintaining per-inode dirty buffer lists, and to implement fdatasync
 by tracking "significant" and "insignificant" (ie.  timestamp) dirty
 flags in the inode separately.

 Don't bother. This is one of the issues that is just going to go away
 when we do dirty blocks correctly (ie the patches that Ingo is working
 on).

The per-inode lists will go away, but the bulk of the patch is the VFS
extension necessary to distinguish between fsync() and fdatasync() (the
VFS methods currently lack any way of making that distinction in a
fsync call).  The (minor) inode changes required to support the
split-personality dirty bits will also be needed: I'll strip those out
from the buffer cache changes.

 Fixing O_SYNC will ruin the performance of such applications. 

 I disagree - I don't think that O_SYNC should imply writing back access
 and mtimes. If the file size really changes, that we definitely should
 write the inode back, I don't think we can honestly say that anything else
 would make sense..

According to singleunix, we have no option: the entire point of having a
separate O_DSYNC is that the O_SYNC clearly specifies semantics which
are too expensive for most applications to use.

--Stephen



We have a problem with O_SYNC

1999-05-28 Thread Stephen C. Tweedie

Hi Linus,

I have a patch I've been trying out to improve fsync performance by
maintaining per-inode dirty buffer lists, and to implement fdatasync by
tracking "significant" and "insignificant" (ie.  timestamp) dirty flags
in the inode separately.  However, in doing this I found a serious
problem with O_SYNC.

Currently, O_SYNC does not flush the inode to disk after the write.  Not
even a write to an O_SYNC file descriptor which extends the file will
cause the inode to be updated.  The newly allocated indirect blocks are
written, but not the new filesize.

This presents a real difficulty because we have no O_DSYNC.  We do have
fsync and fdatasync already defined as syscalls, but we don't have that
distinction for O_SYNC.  Right now, an application (such as Oracle)
which writes in place to already-allocated data using O_SYNC actually
benefits from this: we end up not writing the inode timestamp updates to
disk, so O_SYNC behaves like O_DSYNC in terms of performance.  This is
good. 

Fixing O_SYNC will ruin the performance of such applications.  Not
fixing it just seems inexcusable since we now have the mechanism in
place to do both O_SYNC and O_DSYNC correctly.

We _can_ get around this if we are careful.  What I'd like to do is:

1) Add an O_DSYNC define with the existing bit pattern of O_SYNC
2) Add a new O_SYNC which is O_DSYNC or'ed with a new bit.
3) Advise the database vendors to use O_DSYNC when building their
   applications if that is defined, otherwise use O_SYNC.

Applications using the existing ABI will continue to run correctly on
the new kernel, without performance penalty: they will just get
O_DSYNC.  That's what they get today anyway (except we don't even do
that correctly for extending writes).

Applications using the new O_SYNC or O_DSYNC will work correctly on new
kernels and will get the current, broken behaviour on old kernels.

If you want to look at the diff, it is at

ftp://ftp.uk.linux.org/pub/linux/sct/fs/misc/fsync-2.2.9-a.diff 

It implements fsync and fdatasync correctly, but O_SYNC just works like
O_DSYNC for now (so that people can test out the O_DSYNC performance
while we sort out the API and ABI issues).

--Stephen



Re: Odd code in iput() (since 2.1.60). What for?

1999-03-22 Thread Stephen C. Tweedie

Hi,

On Sat, 20 Mar 1999 15:46:18 -0500 (EST), Alexander Viro
[EMAIL PROTECTED] said:

   Folks, could somebody recall why the check for I_DIRTY had been
 added to iput()? AFAICS it does nothing. If the inode is hashed and clean
 it's already on inode_in_use, otherwise we are in *big* trouble (the only
 reason for that might be crazy -delete_inode() doing
 insert_inode_hash().

To maintain an LRU ordering for recently released inodes, I imagine.
However, right now there's nothing I can see in inode.c which actually
relies on that ordering: whenever we do a free_inodes(), we dump all
the inodes that we can.  In the future, having a sane LRU ordering on
the in-use list may be valuable.

--Stephen