Hi all,
The following patch fully implements O_SYNC, fsync and fdatasync,
at least for ext2. The infrastructure it includes should make it
trivial for any other filesystem to do likewise.
The basic changes are:
Include a per-inode list of dirty buffers
Pass a "datasync" parameter down to the filesystems when fsync
or fdatasync are called, to distinguish between the two (when
fdatasync is specified, we don't have to flush the inode to disk
if only timestamps have changed)
Split I_DIRTY into two bits, one (I_DIRTY_SYNC) which is set
for all dirty inodes, and the other (I_DIRTY_DATASYNC) which
is set only if fdatasync needs to flush the inode (ie. it is
set for everything except for timestamp updates). This means:
The old (flags & I_DIRTY) construct still returns
true if the inode is in any way dirty; and
(flags |= I_DIRTY) sets both bits, as expected.
fs/ext2 and __block_commit_write are modified to record the
all newly dirtied buffers (both data and metadata) on the
inode's dirty block list
generic_file_write now honours the O_SYNC flag and calls
generic_osync_inode(), which flushes the inode dirty buffer
list and calls the inode's fsync method.
Note: currently, the O_SYNC code in generic_file_write calls
generic_osync_inode with datasync==1, which means that O_SYNC is
interpreted as O_DSYNC according to the SUS spec. In other words,
O_SYNC is not guaranteed to flush timestamp updates to disk (but
fsync is). This is important: we do not currently have an O_DSYNC
flag (although that would now be trivial to implement), so existing
apps are forced to use O_SYNC instead. Apps such as Oracle rely on
O_SYNC for write ordering, but due to a 2.2 bug, existing kernels
don't do the timestamp update and hence we achieve decent
performance even without O_DSYNC. We cannot suddenly cause all of
those applications to experience a massive performance drop.
One way round this would be to split O_SYNC into O_DSYNC and
O_TRUESYNC, and in glibc to redefine O_SYNC to be (O_DSYNC |
O_TRUESYNC). If we keep the new O_DSYNC to have the same value
as the old O_SYNC, then:
* Old applications which specified O_SYNC will continue
to get their expected (O_DSYNC) behaviour
* New applications can specify O_SYNC or O_DSYNC and get
the selected behaviour on new kernels
* New applications calling either O_SYNC or O_DSYNC will
still get O_SYNC on old kernels.
In performance testing, "dd" with 64k blocks and writing into an
existing, preallocated file, gets close to theoretical disk bandwidth
(about 13MB/sec on a Cheetah), when using O_SYNC or when doing a
fdatasync between each write. Doing fsync instead gives only about
3MB/sec and results in a lot of audible disk seeking, as expected.
If I don't preallocate the file, then even fdatasync is slow, as it
now has to sync the changed i_size information after every write (and
it gets slower as the file grows and the distance between the inode
and the data being written increases).
--Stephen
--- linux-2.4.0-test1-ac11.osync/fs/block_dev.c.~1~ Fri Jun 9 18:08:09 2000
+++ linux-2.4.0-test1-ac11.osync/fs/block_dev.c Fri Jun 9 18:08:18 2000
@@ -313,7 +313,7 @@
* since the vma has no handle.
*/
-static int block_fsync(struct file *filp, struct dentry *dentry)
+static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
return fsync_dev(dentry->d_inode->i_rdev);
}
--- linux-2.4.0-test1-ac11.osync/fs/buffer.c.~1~Fri Jun 9 18:08:09 2000
+++ linux-2.4.0-test1-ac11.osync/fs/buffer.cFri Jun 9 18:08:18 2000
@@ -68,6 +68,8 @@
* lru_list_lock > hash_table_lock > free_list_lock > unused_list_lock
*/
+#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_inode_buffers)
+
/*
* Hash table gook..
*/
@@ -323,7 +325,7 @@
* filp may be NULL if called via the msync of a vma.
*/
-int file_fsync(struct file *filp, struct dentry *dentry)
+int file_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
struct inode * inode = dentry->d_inode;
struct super_block * sb;
@@ -332,7 +334,7 @@
lock_kernel();
/* sync the inode to buffers */
- write_inode_now(inode);
+ write_inode_now(inode, 0);
/* sync the superblock to buffers */
sb = inode->i_sb;
@@ -373,7 +375,7 @@
/* We need to protect against concurrent writers.. */
down(&inode->i_sem);
- err = file->f_op->fsync(file, dentry);
+ err = file->f_op->fsync(file, dentry, 0);
up(&inode->i_sem);
out_putf:
@@ -406,9 +408,8 @@
if (!file->f_op || !file->f_op->fsync)
goto out_putf;
- /* this needs further work, at the moment it is identical to fsync() */
down(&inode->i_sem);
- err = f