Hi Linus,

I have a patch I've been trying out to improve fsync performance by
maintaining per-inode dirty buffer lists, and to implement fdatasync by
tracking "significant" and "insignificant" (ie.  timestamp) dirty flags
in the inode separately.  However, in doing this I found a serious
problem with O_SYNC.

Currently, O_SYNC does not flush the inode to disk after the write.  Not
even a write to an O_SYNC file descriptor which extends the file will
cause the inode to be updated.  The newly allocated indirect blocks are
written, but not the new filesize.

This presents a real difficulty because we have no O_DSYNC.  We do have
fsync and fdatasync already defined as syscalls, but we don't have that
distinction for O_SYNC.  Right now, an application (such as Oracle)
which writes in place to already-allocated data using O_SYNC actually
benefits from this: we end up not writing the inode timestamp updates to
disk, so O_SYNC behaves like O_DSYNC in terms of performance.  This is
good. 

Fixing O_SYNC will ruin the performance of such applications.  Not
fixing it just seems inexcusable since we now have the mechanism in
place to do both O_SYNC and O_DSYNC correctly.

We _can_ get around this if we are careful.  What I'd like to do is:

1) Add an O_DSYNC define with the existing bit pattern of O_SYNC
2) Add a new O_SYNC which is O_DSYNC or'ed with a new bit.
3) Advise the database vendors to use O_DSYNC when building their
   applications if that is defined, otherwise use O_SYNC.

Applications using the existing ABI will continue to run correctly on
the new kernel, without performance penalty: they will just get
O_DSYNC.  That's what they get today anyway (except we don't even do
that correctly for extending writes).

Applications using the new O_SYNC or O_DSYNC will work correctly on new
kernels and will get the current, broken behaviour on old kernels.

If you want to look at the diff, it is at

        ftp://ftp.uk.linux.org/pub/linux/sct/fs/misc/fsync-2.2.9-a.diff 

It implements fsync and fdatasync correctly, but O_SYNC just works like
O_DSYNC for now (so that people can test out the O_DSYNC performance
while we sort out the API and ABI issues).

--Stephen

Reply via email to