To fix it, ITSM that it is enough to hold a "do not close lock" on the file
while a flush is in progress (a short time) that would prevent mdclose to do
its stuff.

Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct?

Basically yes, I'm suggesting a mutex in the vdf struct.

If that, how would you deal with FileClose() being called?

Just wait for the mutex, which would be held while flushes are accumulated into the flush context and released after the flush is performed and the fd is not necessary anymore for this purpose, which is expected to be short (at worst between the wake & sleep of the checkpointer, and just one file at a time).

I'm concious that the patch only addresses *checkpointer* writes, not those
from bgwrither or backends writes. I agree that these should need to be
addressed at some point as well, but given the time to get a patch through,
the more complex the slower (sort propositions are 10 years old), I think
this should be postponed for later.

I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.

Hmmm. See below.

What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.

I did not see such behavior in the many tests I ran. Could you share more
precise details so that I can try to reproduce this performance regression?
(available memory, shared buffers, db size, ...).


I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from 
occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
 shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short. I tend to do tests with much larger timeouts. I would advise against a short timeout esp. in a high throughput system, the whole point of the checkpointer is to accumulate as much changes as possible.

I'll look into that.

This explanation seems to suggest that if bgwriter/workders write are sorted
and/or coordinated with the checkpointer somehow then all would be well?

Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order.  We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.

Maybe the sorting could be shared with others so that everybody uses the same order?

That would suggest to have one global sorting of buffers, maybe maintained by the checkpointer, which could be used by all processes that need to scan the buffers (in file order), instead of scanning them in memory order.

For this purpose, I think that the initial index-based sorting would suffice. Could be resorted periodically with some delay maintained in a guc, or when significant buffer changes have occured (read & writes).

ISTM that this explanation could be checked by looking whether
bgwriter/workers writes are especially large compared to checkpointer writes
in those cases with reduced throughput? The data is in the log.

What do you mean with "large"? Numerous?

I mean the amount of buffers written by bgwriter/worker is greater than what is written by the checkpointer. If all fits in shared buffers, bgwriter/worker mostly do not need to write anything and the checkpointer does all the writes.

The larger the memory needed, the more likely workers/bgwriter will have to quick in and generate random I/Os because nothing sensible is currently done, so this is consistent with your findings, although I'm surprised that it would have a large effect on throughput, as already said.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective

You mean, as evidenced by the results, or is that what you'd actually
expect?

What I would expect...

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to