Hi Linus,
I have a patch I've been trying out to improve fsync performance by
maintaining per-inode dirty buffer lists, and to implement fdatasync by
tracking "significant" and "insignificant" (ie. timestamp) dirty flags
in the inode separately. However, in doing this I found a serious
problem with O_SYNC.
Currently, O_SYNC does not flush the inode to disk after the write. Not
even a write to an O_SYNC file descriptor which extends the file will
cause the inode to be updated. The newly allocated indirect blocks are
written, but not the new filesize.
This presents a real difficulty because we have no O_DSYNC. We do have
fsync and fdatasync already defined as syscalls, but we don't have that
distinction for O_SYNC. Right now, an application (such as Oracle)
which writes in place to already-allocated data using O_SYNC actually
benefits from this: we end up not writing the inode timestamp updates to
disk, so O_SYNC behaves like O_DSYNC in terms of performance. This is
good.
Fixing O_SYNC will ruin the performance of such applications. Not
fixing it just seems inexcusable since we now have the mechanism in
place to do both O_SYNC and O_DSYNC correctly.
We _can_ get around this if we are careful. What I'd like to do is:
1) Add an O_DSYNC define with the existing bit pattern of O_SYNC
2) Add a new O_SYNC which is O_DSYNC or'ed with a new bit.
3) Advise the database vendors to use O_DSYNC when building their
applications if that is defined, otherwise use O_SYNC.
Applications using the existing ABI will continue to run correctly on
the new kernel, without performance penalty: they will just get
O_DSYNC. That's what they get today anyway (except we don't even do
that correctly for extending writes).
Applications using the new O_SYNC or O_DSYNC will work correctly on new
kernels and will get the current, broken behaviour on old kernels.
If you want to look at the diff, it is at
ftp://ftp.uk.linux.org/pub/linux/sct/fs/misc/fsync-2.2.9-a.diff
It implements fsync and fdatasync correctly, but O_SYNC just works like
O_DSYNC for now (so that people can test out the O_DSYNC performance
while we sort out the API and ABI issues).
--Stephen