To fix it, ITSM that it is enough to hold a "do not close lock" on the file
while a flush is in progress (a short time) that would prevent mdclose to do
its stuff.
Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct?
Basically yes, I'm suggesting a mutex in the vdf struct.
If that, how would you deal with FileClose() being called?
Just wait for the mutex, which would be held while flushes are accumulated
into the flush context and released after the flush is performed and the
fd is not necessary anymore for this purpose, which is expected to be
short (at worst between the wake & sleep of the checkpointer, and just one
file at a time).
I'm concious that the patch only addresses *checkpointer* writes, not those
from bgwrither or backends writes. I agree that these should need to be
addressed at some point as well, but given the time to get a patch through,
the more complex the slower (sort propositions are 10 years old), I think
this should be postponed for later.
I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.
Hmmm. See below.
What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%.
I did not see such behavior in the many tests I ran. Could you share more
precise details so that I can try to reproduce this performance regression?
(available memory, shared buffers, db size, ...).
I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from
occuring.
I'll show actual numbers at some point yes. I tried three different systems:
* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
shared_buffers. Tried checkpoint timeouts from 60 to 300s.
Hmmm. This is quite short. I tend to do tests with much larger timeouts. I
would advise against a short timeout esp. in a high throughput system, the
whole point of the checkpointer is to accumulate as much changes as
possible.
I'll look into that.
This explanation seems to suggest that if bgwriter/workders write are sorted
and/or coordinated with the checkpointer somehow then all would be well?
Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order. We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.
Maybe the sorting could be shared with others so that everybody uses the
same order?
That would suggest to have one global sorting of buffers, maybe maintained
by the checkpointer, which could be used by all processes that need to
scan the buffers (in file order), instead of scanning them in memory
order.
For this purpose, I think that the initial index-based sorting would
suffice. Could be resorted periodically with some delay maintained in a
guc, or when significant buffer changes have occured (read & writes).
ISTM that this explanation could be checked by looking whether
bgwriter/workers writes are especially large compared to checkpointer writes
in those cases with reduced throughput? The data is in the log.
What do you mean with "large"? Numerous?
I mean the amount of buffers written by bgwriter/worker is greater than
what is written by the checkpointer. If all fits in shared buffers,
bgwriter/worker mostly do not need to write anything and the checkpointer
does all the writes.
The larger the memory needed, the more likely workers/bgwriter will have
to quick in and generate random I/Os because nothing sensible is currently
done, so this is consistent with your findings, although I'm surprised
that it would have a large effect on throughput, as already said.
Hmmm. The shorter the timeout, the more likely the sorting NOT to be
effective
You mean, as evidenced by the results, or is that what you'd actually
expect?
What I would expect...
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers