Re: [HACKERS] checkpointer continuous flushing

Fabien COELHO Sun, 21 Jun 2015 22:53:12 -0700


Hello Andres,

So this is an evidence-based decision.


Meh. You're testing on low concurrency.


Well, I'm just testing on the available box.

I do not see the link between high concurrency and whether moving fsync asearly as possible would have a large performance impact. I think it mightbe interesting if bgwriter is doing a lot of writes, but I'm not sureunder which configuration & load that would be.

I think it's a really bad idea to do this in chunks.


The small problem I see is that for a very large setting there could be
several seconds or even minutes of sorting, which may or may not be
desirable, so having some control on that seems a good idea.


If the sorting of the dirty blocks alone takes minutes, it'll never
finish writing that many buffers out. That's a utterly bogus argument.

Well, if in the future you have 8 TB of memory (I've seen a 512GB memoryserver a few weeks ago), set shared_buffers=2TB, then if I'm not mistakenin the worst case you may have 256 millions 8k-buffers to checkpoint. Thenit really depends on the I/O backend stuff used by the box, but if youbought 8 TB of RAM probably you would have a nice I/O stuff attached.

Another argument is that Tom said he wanted that:-)


I don't think he said that when we discussed this last.


That is what I was recalling when I wrote this sentence:

http://www.postgresql.org/message-id/6599.1409421...@sss.pgh.pa.us

But it had more to do with memory-allocation management.

In practice the value can be set at a high value so that it is nearly always
sorted in one go. Maybe value "0" could be made special and used to trigger
this behavior systematically, and be the default.


You're just making things too complicated.

ISTM that it is not really complicated, but anyway it is easy to changethe checkpoint_sort stuff to a boolean.

In the reported performance tests, the is usually just one chunk anyway,sometimes two, so this gives an idea of the overall performance effect.

This is not an issue if the chunks are large enough, and anyway the guc
allows to change the behavior as desired.


I don't think this is true. If two consecutive blocks are dirty, but you
sync them in two different chunks, you *always* will cause additional
random IO.

I think that it could be a small number if the chunks are large, i.e. theperformance benefit of sorting larger and larger chunks is decreasing.

Either the drive will have to skip the write for that block,
or the os will prefetch the data. More importantly with SSDs it voids
the wear leveling advantages.


Possibly. I do not understand wear leveling done by SSD firmware.

often interleaved. That pattern is horrible for SSDs too. We should always
try to do this at once, and only fail back to using less memory if we
couldn't allocate everything.


The memory is needed anyway in order to avoid a double or significantly more
heavy implementation for the throttling loop. It is allocated once on the
first checkpoint. The allocation could be moved to the checkpointer
initialization if this is a concern. The memory needed is one int per
buffer, which is smaller than the 2007 patch.


There's a reason the 2007 patch (and my revision of it last year) did
what it did. You can't just access buffer descriptors without
locking.

I really think that you can because the sorting is really "advisory", i.e.the checkpointer will work fine if the sorting is wrong or not done atall, as it is now, when the checkpointer writes buffers. The onlycondition is that the buffers must not be moved with their "to write inthis checkpoint" flag, but this is also necessary for the currentcheckpointer stuff to work.

Moreover, this trick is alreay pre-existing from the patch I submitted:some tests are done without locking, but the actual "buffer write" doesthe locking and would skip it if the previous test was wrong, as describedin comments in the code.

Besides, causing additional cacheline bouncing during the
sorting process is a bad idea.

Hmmm. The impact would be to multiply the memory required by 3 or 4(buf_id, relation, forknum, offset), instead of just buf_id, and Iunderstood that memory was a concern.

Moreover, once the sort process get the lines which contain the sortingdata from the buffer descriptor in its cache, I think that it should bepretty much okay. Incidentally, they would probably have been brought tocache by the scan to collect them. Also, I do not think that the sortingtime for 128000 buffers, and possible cache misses, was a big issue, but Ido not have a measure to defend that. I could try to collect some dataabout that.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

Reply via email to