Re: [HACKERS] checkpointer continuous flushing

Fabien COELHO Thu, 12 Nov 2015 08:46:20 -0800

To fix it, ITSM that it is enough to hold a "do not close lock" on the file
while a flush is in progress (a short time) that would prevent mdclose to do
its stuff.


Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct?


Basically yes, I'm suggesting a mutex in the vdf struct.

If that, how would you deal with FileClose() being called?

Just wait for the mutex, which would be held while flushes are accumulatedinto the flush context and released after the flush is performed and thefd is not necessary anymore for this purpose, which is expected to beshort (at worst between the wake & sleep of the checkpointer, and just onefile at a time).

I'm concious that the patch only addresses *checkpointer* writes, not those
from bgwrither or backends writes. I agree that these should need to be
addressed at some point as well, but given the time to get a patch through,
the more complex the slower (sort propositions are 10 years old), I think
this should be postponed for later.


I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.


Hmmm. See below.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.


I did not see such behavior in the many tests I ran. Could you share more
precise details so that I can try to reproduce this performance regression?
(available memory, shared buffers, db size, ...).



I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from 
occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
 shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short. I tend to do tests with much larger timeouts. Iwould advise against a short timeout esp. in a high throughput system, thewhole point of the checkpointer is to accumulate as much changes aspossible.


I'll look into that.

This explanation seems to suggest that if bgwriter/workders write are sorted
and/or coordinated with the checkpointer somehow then all would be well?


Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order.  We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.

Maybe the sorting could be shared with others so that everybody uses thesame order?

That would suggest to have one global sorting of buffers, maybe maintainedby the checkpointer, which could be used by all processes that need toscan the buffers (in file order), instead of scanning them in memoryorder.

For this purpose, I think that the initial index-based sorting wouldsuffice. Could be resorted periodically with some delay maintained in aguc, or when significant buffer changes have occured (read & writes).

ISTM that this explanation could be checked by looking whether
bgwriter/workers writes are especially large compared to checkpointer writes
in those cases with reduced throughput? The data is in the log.


What do you mean with "large"? Numerous?

I mean the amount of buffers written by bgwriter/worker is greater thanwhat is written by the checkpointer. If all fits in shared buffers,bgwriter/worker mostly do not need to write anything and the checkpointerdoes all the writes.

The larger the memory needed, the more likely workers/bgwriter will haveto quick in and generate random I/Os because nothing sensible is currentlydone, so this is consistent with your findings, although I'm surprisedthat it would have a large effect on throughput, as already said.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective


You mean, as evidenced by the results, or is that what you'd actually
expect?


What I would expect...

--
Fabien.


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

Reply via email to