Hello Andres,

In my performance testing it showed that calling PerformFileFlush() only
at segment boundaries and in CheckpointWriteDelay() can lead to rather
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
problematic because it only is triggered while on schedule, and not when
behind.

When behind, the PerformFileFlush should be called on segment
boundaries.

That means it's flushing up to a gigabyte of data at once. Far too
much.

Hmmm. I do not get it. There would not be gigabytes, there would be as much as was written since the last sleep, about 100 ms ago, which is not likely to be gigabytes?

The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that is the point of the call before sleeping.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small.

Why?

Because the point of sorting is to generate sequential writes so that the HDD has a lot of aligned stuff to write without moving the head, and 32 is rather small for that.

The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so I'd rather have 128? Also, it can be changed, so maybe it should really be a guc?

If you flush much more sync_file_range starts to do work in the foreground.

Argh, too bad. I would have hoped that the would just deal with in an asynchronous way, this is not a "fsync" call, just a flush advise.

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

I'm not sure that mmap/msync can be used for this purpose, because there is
no real control it seems about where the file is mmapped.

I'm not following? Why does it matter where a file is mapped?

Because it should be in shared buffers where pg needs it? You probably should not want to mmap all pg data files in user space for a large database? Or if so, currently the OS keeps the data in memory if it has enough space, but if you got to mmap this cache management would be pg responsability, if I understand correctly mmap and your intentions.

I have had a friend (Christian Kruse, thanks!)  confirm that at least on
OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
that should be the case on freebsd too.

Good. My concern is how mmap could be used, though, not the flushing part.

Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
set in most case is an improvement.

Yes, it'll not matter that much in many cases. But I rather disliked the
NextBufferToWrite() implementation, especially that it walkes the array
multiple times. And I did see setups with ~15 tablespaces.

ISTM that it is rather an argument for taking the tablespace into the sorting, not necessarily for a binary heap.

I also noted this point, but I'm not sure how to have a better approach, so
I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
effect on performance for the test I ran then. The point of having not too
small a value is that it provide some significant work to the IO subsystem
without overflowing it.

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.

It sounds reasonable, but what would be the criterion?

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to