Bruce Momjian wrote:

Shridhar Daithankar wrote:

On Friday 14 November 2003 22:10, Jan Wieck wrote:

Shridhar Daithankar wrote:

On Friday 14 November 2003 03:05, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a comfortable compromise? Or this is going to use fsync for all of them.

IMO, with fsync, we tell kernel that you can write this buffer. It may or
may not write it immediately, unless it is hard sync.

I think it's more the other way around. On some systems sync() might return before all buffers are flushed to disk, while fsync() does not.

Oops.. that's bad.


Yes, one I idea I had was to do an fsync on a new file _after_ issuing
sync, hoping that this will complete after all the sync buffers are
done.


Since postgresql can afford lazy writes for data files, I think this
could work.

The whole point of a checkpoint is to know for certain that a specific change is in the datafile, so that it is safe to throw away older WAL segments.

I just made another posing on patches for a thread crossing win32-devel.


Essentially I said

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does it. The hackery in xlog.c is not exactly trivial.)


We write WAL, then fsync, so if we write multiple blocks, we can write
them and fsync once, rather than O_SYNC every write.


2. Open data files normally and fsync them only in background writer process.

Now BGWriter process will flush everything at the time of checkpointing. It does not need to flush WAL because of O_SYNC(ideally but an additional fsync won't hurt). So it just flushes all the file descriptors touched since last checkpoint, which should not be much of a load because it is flushing those files intermittently anyways.

It could also work nicely if only background writer fsync the data files. Backends can either wait or proceed to other business by the time disk is flushed. Backends needs to wait for certain while committing and it should be rather small delay of syncing to disk in current process as opposed to in background process.

In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of checkpoint. I don't know how difficult that would be.


What is different in current BGwriter implementation? Use of sync()?


Well, basically we are still discussing how to do this.  Right now the
backend writer patch uses sync(), but the final version will use fsync
or O_SYNC, or maybe nothing.

The open items are whether a background process can keep the dirty
buffers cleaned fast enough to keep up with the maximum number of
backends.  We might need to use multiple processes or threads to do
this.   We certainly will have a background writer in 7.5 --- the big
question is whether _all_ write will go through it.   It certainly would
be nice if it could, and Tom thinks it can, so we are still exploring
this.

Given that fsync is blocking, the background writer has to scale up in terms of processes/threads and load w.r.t. disk flushing.


I would vote for threads for a simple reason that, in BGWriter, threads are needed only to flush the file. Get the fd, fsync it and get next one. No need to make entire process thread safe.

Furthermore BGWriter has to detect the disk limit. If adding threads does not improve fsyncing speed, it should stop adding them and wait. There is nothing to do when disk is saturated.

If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer;  that will be
slower.

Certainly. However an O_SYNC open file would not require fsync separately. I suggested it only for WAL. But for WAL block grouping as suggested in another post, all files with fsync might be a good idea.


Just a thought.

Shridhar


---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match

Reply via email to