Hello Andres,
I thought of adding a pointer to the current flush structure at the vfd
level, so that on closing a file with a flush in progress the flush can be
done and the structure properly cleaned up, hence later the checkpointer
would see a clean thing and be able to skip it instead of generating flushes
on a closed file or on a different file...
Maybe I'm missing something, but that is the plan I had in mind.
That might work, although it'd not be pretty (not fatally so
though).
Alas, any solution has to communicate somehow between the API levels, so
it cannot be "pretty", although we should avoid the worse.
But I'm inclined to go a different way: I think it's a mistake to do
flusing based on a single file. It seems better to track a fixed number
of outstanding 'block flushes', independent of the file. Whenever the
number of outstanding blocks is exceeded, sort that list, and flush all
outstanding flush requests after merging neighbouring flushes.
Hmmm. I'm not sure I understand your strategy.
I do not think that flushing without a prior sorting would be effective,
because there is no clear reason why buffers written together would then
be next to the other and thus give sequential write benefits, we would
just get flushed random IO, I tested that and it worked badly.
One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the
thread, so it makes sense to limit this cost, hence the aggregation. These
removed some performation regression I had in some cases.
Also, the granularity of the buffer flush call is a file + offset + size,
so necessarily it should be done this way (i.e. per file).
Once buffers are sorted per file and offset within file, then written
buffers are as close as possible one after the other, the merging is very
easy to compute (it is done on the fly, no need to keep the list of
buffers for instance), it is optimally effective, and when the
checkpointed file changes then we will never go back to it before the next
checkpoint, so there is no reason not to flush right then.
So basically I do not see a clear positive advantage to your suggestion,
especially when taking into consideration the scheduling process of the
scheduler:
In effect the checkpointer already works with little bursts of activity
between sleep phases, so that it writes buffers a few at a time, so it may
already work more or less as you expect, but not for the same reason.
The closest stategy that I experimented which is maybe close to your
suggestion was to manage a minimum number of buffers to write when awaken
and to change the sleep delay in between, but I had no clear way to choose
values and the experiments I did did not show significant performance
impact by varying these parameters, so I kept that out. If you find a
magic number of buffer which results in consistant better performance,
fine with me, but this is independent with aggregating before or after.
Imo that means that we'd better track writes on a relfilenode + block
number level.
I do not think that it is a better option. Moreover, the current approach
has been proven to be very effective on hundreds of runs, so redoing it
differently for the sake of it does not look like good resource
allocation.
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers