Tom Lane wrote: > Greg Stark <[EMAIL PROTECTED]> writes: > > Tom Lane <[EMAIL PROTECTED]> writes: > >> You want to find, open, and fsync() every file in the database cluster > >> for every checkpoint? Sounds like a non-starter to me. > > > Except a) this is outside any critical path, and b) only done every few > > minutes and c) the fsync calls on files with no dirty buffers ought to be > > cheap, at least as far as i/o. > > The directory search and opening of the files is in itself nontrivial > overhead ... particularly on systems where open(2) isn't speedy, such > as Solaris. I also disbelieve your assumption that fsync'ing a file > that doesn't need it will be free. That depends entirely on what sort > of indexes the OS keeps on its buffer cache. There are Unixen where > fsync requires a scan through the entire buffer cache because there is > no data structure that permits finding associated buffers any more > efficiently than that. (IIRC, the HPUX system I'm typing this on is > like that.) On those sorts of systems, we'd be way better off to use > O_SYNC or O_DSYNC on all our writes than to invoke multiple fsyncs. > Check the archives --- this was all gone into in great detail when we > were testing alternative methods for fsyncing the WAL files.
Not sure on this one --- let's look at our options O_SYNC fsync sync Now, O_SYNC is going to force every write to the disk. If we have a transaction that has to write lots of buffers (has to write them to reuse the shared buffer), it will have to wait for every buffer to hit disk before the write returns --- this seems terrible to me and gives the drive no way to group adjacent writes. Even on HPUX, which has poor fsync dirty buffer detection, if the fsync is outside the main processing loop (checkpoint process), isn't fsync better than O_SYNC? Now, if we are sure that writes will happen only in the checkpoint process, O_SYNC would be OK, I guess, but will we ever be sure of that? I can't imagine a checkpoint process keeping up with lots of active backends, especially if the writes use O_SYNC. The problem is that instead of having backend write everything to kernel buffers, we are all of a sudden forcing all writes of dirty buffers to disk. sync() starts to look very attractive compared to that option. fsync is better in that we can force it after a number of writes, and can delay it, so we can write a buffer and reuse it, then later issue the fsync. That is a win, though it doesn't allow the drive to group adjacent writes in different files. Sync of course allows grouping of all writes by the drive, but writes all non-PostgreSQL dirty buffers too. Ideally, we would have an fsync() where we could pass it a list of our files and it would do all of them optimally. >From what I have heard so far, sync() still seems like the most efficient method. I know it only schedules write, but with a sleep after it, it seems like maybe the best bet. -- Bruce Momjian | http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings