"Curtis Faith" <[EMAIL PROTECTED]> writes:
> ... most file systems can't process fsync's
> simultaneous with other writes, so those writes block because the file
> system grabs its own internal locks.

Oh?  That would be a serious problem, but I've never heard that asserted
before.  Please provide some evidence.

On a filesystem that does have that kind of problem, can't you avoid it
just by using O_DSYNC on the WAL files?  Then there's no need to call
fsync() at all, except during checkpoints (which actually issue sync()
not fsync(), anyway).

> Whether by threads or multiple processes, there is the same contention on
> the file through multiple writers. The file system can decide to reorder
> writes before they start but not after. If a write comes after a
> fsync starts it will have to wait on that fsync.

AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
on safety grounds (we want to be sure we have a consistent WAL up to the
end of what we've written).  Even if we can allow some reordering when a
single transaction puts out a large volume of WAL data, I fail to see
where any large gain is going to come from.  We're going to be issuing
those writes sequentially and that ought to match the disk layout about
as well as can be hoped anyway.

> Likewise a given process's writes can NEVER be reordered if they are
> submitted synchronously, as is done in the calls to flush the log as
> well as the dirty pages in the buffer in the current code.

We do not fsync buffer pages; in fact a transaction commit doesn't write
buffer pages at all.  I think the above is just a misunderstanding of
what's really happening.  We have synchronous WAL writing, agreed, but
we want that AFAICS.  Data block writes are asynchronous (between
checkpoints, anyway).

There is one thing in the current WAL code that I don't like: if the WAL
buffers fill up then everybody who would like to make WAL entries is
forced to wait while some space is freed, which means a write, which is
synchronous if you are using O_DSYNC.  It would be nice to have a
background process whose only task is to issue write()s as soon as WAL
pages are filled, thus reducing the probability that foreground
processes have to wait for WAL writes (when they're not committing that
is).  But this could be done portably with one more postmaster child
process; I see no real need to dabble in aio_write.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Reply via email to