Hello Andres,

Here is a v13, which is just a rebase after 1aba62ec.

I'm working on this patch, to get it into a state I think it'd be
commitable.

I'll review it carefully. Also, if you can include some performance feature it would help, even if I'll do some more runs.

In my performance testing it showed that calling PerformFileFlush() only at segment boundaries and in CheckpointWriteDelay() can lead to rather spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is problematic because it only is triggered while on schedule, and not when behind.

When behind, the PerformFileFlush should be called on segment boundaries.
The idea was not to go to sleep without flushing, and to do it as little as possible.

My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small. Not sure what a good "limit" would be. It could depend whether pages are close or not.

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

I'm not sure that mmap/msync can be used for this purpose, because there is no real control it seems about where the file is mmapped.

As far as I can see the while (nb_spaces != 0)/NextBufferToWrite() logic doesn't work correctly if tablespaces aren't actually sorted. I'm actually inclined to fix this by simply removing the flag to enable/disable sorting.

I do no think that there is a significant downside to having sort always on, but showing it requires to be able to test, so to have a guc. The point of the guc is to demonstrate that the feature is harmless:-)

Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
so many places looks ugly, I want to push that to the underlying
functions. If we add a different flushing approach we shouldn't have to
touch several places that don't actually really care.

I agree that it is pretty ugly, but I do not think that you can remove them all. You need at least one for checking the guc and one for enabling the feature. Maybe their number could be reduced if the functions are switched to do-nothing stubs which are called nevertheless, but I was not keen on letting unused code when there is no sync_file_range nor posix_fadvise.

I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
seems to work well, with a bit less code actually.

Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element set in most case is an improvement.

I'll post this after some more cleanup & testing.

I'll have a look when it is ready.

I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
particularly good. In high throughput workloads the 100ms sleep is too
long, leading to bursty IO behaviour. If 1k+ buffers a written out a
second 100ms is a rather long sleep. For another that we only sleep
100ms when the write rate is low makes the checkpoint finish rather
quickly - on a slow disk (say microsd) that can cause unneccesary
slowdowns for concurrent activity.  ISTM we should calculate the sleep
time in a better way.

I also noted this point, but I'm not sure how to have a better approach, so I let it as it is. I tried 50 ms & 200 ms on some runs, without significant effect on performance for the test I ran then. The point of having not too small a value is that it provide some significant work to the IO subsystem without overflowing it. On average it does not matter. I'm unsure how it would interact with flushing. So I decided not to do anything about it. Maybe it should be a guc, but I would not know how to choose it.

The SIGHUP behaviour is also weird. Anyway, this probably belongs on a new thread.

Probably. I did not try to look at that.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to