Re: O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Stephen C. Tweedie

Hi,

On Fri, Jun 09, 2000 at 02:51:18PM -0700, Ulrich Drepper wrote:
> 
> Have you thought about O_RSYNC and whether it is possible/useful to
> support it separately?

It would be possible and useful, but it's entirely separate from the
write path and probably doesn't make sense until we've got O_DIRECT
working (O_RSYNC is closely related to that).

Cheers,
 Stephen



Re: O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Ulrich Drepper

Sorry for the separate mail:

"Stephen C. Tweedie" <[EMAIL PROTECTED]> writes:

> If I don't preallocate the file, then even fdatasync is slow, [...]

This might be a good argument to implement posix_fallocate() in the
kernel.

-- 
---.  drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `



Re: O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Stephen C. Tweedie

Hi,

On Fri, Jun 09, 2000 at 02:53:19PM -0700, Ulrich Drepper wrote:
> 
> > If I don't preallocate the file, then even fdatasync is slow, [...]
> 
> This might be a good argument to implement posix_fallocate() in the
> kernel.

No.  If we do posix_fallocate(), then there are only two choices:
we either pre-zero the file contents (in which case we are as well
doing it from user space), or we record in the inode that the file
isn't pre-zeroed and so optimise things.

If we do that optimisation, then doing an O_DSYNC write to the 
already-allocated file will have to record in the inode that we are
pushing forward the non-prezeroed fencepost in the file, so we end
up having to seek back to the inode for each write anyway, so we
lose any possible benefit during the writes.

Once you have a database file written and preallocated, this is all
academic since all further writes will be in place and so will be 
fast the th O_DSYNC/fdatasync support.

Cheers,
 Stephen



Re: O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Ulrich Drepper

"Stephen C. Tweedie" <[EMAIL PROTECTED]> writes:

>   * Old applications which specified O_SYNC will continue
> to get their expected (O_DSYNC) behaviour
> 
>   * New applications can specify O_SYNC or O_DSYNC and get
> the selected behaviour on new kernels
> 
>   * New applications calling either O_SYNC or O_DSYNC will
> still get O_SYNC on old kernels.

Have you thought about O_RSYNC and whether it is possible/useful to
support it separately?

-- 
---.  drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
Red Hat  `--' drepper at redhat.com   `



O_SYNC patches for 2.4.0-test1-ac11

2000-06-09 Thread Stephen C. Tweedie

Hi all,

The following patch fully implements O_SYNC, fsync and fdatasync,
at least for ext2.  The infrastructure it includes should make it
trivial for any other filesystem to do likewise.

The basic changes are:

Include a per-inode list of dirty buffers

Pass a "datasync" parameter down to the filesystems when fsync
or fdatasync are called, to distinguish between the two (when
fdatasync is specified, we don't have to flush the inode to disk
if only timestamps have changed)

Split I_DIRTY into two bits, one (I_DIRTY_SYNC) which is set
for all dirty inodes, and the other (I_DIRTY_DATASYNC) which 
is set only if fdatasync needs to flush the inode (ie. it is
set for everything except for timestamp updates).  This means:

The old (flags & I_DIRTY) construct still returns 
true if the inode is in any way dirty; and

(flags |= I_DIRTY) sets both bits, as expected.

fs/ext2 and __block_commit_write are modified to record the
all newly dirtied buffers (both data and metadata) on the
inode's dirty block list

generic_file_write now honours the O_SYNC flag and calls
generic_osync_inode(), which flushes the inode dirty buffer
list and calls the inode's fsync method.

Note: currently, the O_SYNC code in generic_file_write calls 
generic_osync_inode with datasync==1, which means that O_SYNC is
interpreted as O_DSYNC according to the SUS spec.  In other words,
O_SYNC is not guaranteed to flush timestamp updates to disk (but
fsync is).  This is important: we do not currently have an O_DSYNC
flag (although that would now be trivial to implement), so existing
apps are forced to use O_SYNC instead.  Apps such as Oracle rely on
O_SYNC for write ordering, but due to a 2.2 bug, existing kernels
don't do the timestamp update and hence we achieve decent 
performance even without O_DSYNC.  We cannot suddenly cause all of
those applications to experience a massive performance drop.

One way round this would be to split O_SYNC into O_DSYNC and
O_TRUESYNC, and in glibc to redefine O_SYNC to be (O_DSYNC |
O_TRUESYNC).  If we keep the new O_DSYNC to have the same value
as the old O_SYNC, then:

* Old applications which specified O_SYNC will continue
  to get their expected (O_DSYNC) behaviour

* New applications can specify O_SYNC or O_DSYNC and get
  the selected behaviour on new kernels

* New applications calling either O_SYNC or O_DSYNC will
  still get O_SYNC on old kernels.

In performance testing, "dd" with 64k blocks and writing into an 
existing, preallocated file, gets close to theoretical disk bandwidth
(about 13MB/sec on a Cheetah), when using O_SYNC or when doing a
fdatasync between each write.  Doing fsync instead gives only about
3MB/sec and results in a lot of audible disk seeking, as expected.
If I don't preallocate the file, then even fdatasync is slow, as it
now has to sync the changed i_size information after every write (and
it gets slower as the file grows and the distance between the inode 
and the data being written increases).

--Stephen


--- linux-2.4.0-test1-ac11.osync/fs/block_dev.c.~1~ Fri Jun  9 18:08:09 2000
+++ linux-2.4.0-test1-ac11.osync/fs/block_dev.c Fri Jun  9 18:08:18 2000
@@ -313,7 +313,7 @@
  * since the vma has no handle.
  */
  
-static int block_fsync(struct file *filp, struct dentry *dentry)
+static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
 {
return fsync_dev(dentry->d_inode->i_rdev);
 }
--- linux-2.4.0-test1-ac11.osync/fs/buffer.c.~1~Fri Jun  9 18:08:09 2000
+++ linux-2.4.0-test1-ac11.osync/fs/buffer.cFri Jun  9 18:08:18 2000
@@ -68,6 +68,8 @@
  * lru_list_lock > hash_table_lock > free_list_lock > unused_list_lock
  */
 
+#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_inode_buffers)
+
 /*
  * Hash table gook..
  */
@@ -323,7 +325,7 @@
  * filp may be NULL if called via the msync of a vma.
  */
  
-int file_fsync(struct file *filp, struct dentry *dentry)
+int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
 {
struct inode * inode = dentry->d_inode;
struct super_block * sb;
@@ -332,7 +334,7 @@
 
lock_kernel();
/* sync the inode to buffers */
-   write_inode_now(inode);
+   write_inode_now(inode, 0);
 
/* sync the superblock to buffers */
sb = inode->i_sb;
@@ -373,7 +375,7 @@
 
/* We need to protect against concurrent writers.. */
down(&inode->i_sem);
-   err = file->f_op->fsync(file, dentry);
+   err = file->f_op->fsync(file, dentry, 0);
up(&inode->i_sem);
 
 out_putf:
@@ -406,9 +408,8 @@
if (!file->f_op || !file->f_op->fsync)
goto out_putf;
 
-   /* this needs further work, at the moment it is identical to fsync() */
down(&inode->i_sem);
-   err = f

cache fs

2000-06-09 Thread Andrew Park

Hi, folks

Someone from this list was working on the cache FS for linux while ago
(more than a year ago I think...)  I was just wondering whether that was
completed and in workable shape or not.  It'll be nice if I could get
that going in the lab I work at.
Thank you


Andrew Park
   
 
CDFlab Systems Administrator   www.cdf.utoronto.ca |
Team BlueShirt Developer www.blueshirt.org |
GnuPG Signature  www.cdf.utoronto.ca/~apark/public_key.txt |