Hello Andres,

I thought of adding a pointer to the current flush structure at the vfd
level, so that on closing a file with a flush in progress the flush can be
done and the structure properly cleaned up, hence later the checkpointer
would see a clean thing and be able to skip it instead of generating flushes
on a closed file or on a different file...

Maybe I'm missing something, but that is the plan I had in mind.

That might work, although it'd not be pretty (not fatally so
though).

Alas, any solution has to communicate somehow between the API levels, so it cannot be "pretty", although we should avoid the worse.

But I'm inclined to go a different way: I think it's a mistake to do flusing based on a single file. It seems better to track a fixed number of outstanding 'block flushes', independent of the file. Whenever the number of outstanding blocks is exceeded, sort that list, and flush all outstanding flush requests after merging neighbouring flushes.

Hmmm. I'm not sure I understand your strategy.

I do not think that flushing without a prior sorting would be effective, because there is no clear reason why buffers written together would then be next to the other and thus give sequential write benefits, we would just get flushed random IO, I tested that and it worked badly.

One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the thread, so it makes sense to limit this cost, hence the aggregation. These removed some performation regression I had in some cases.

Also, the granularity of the buffer flush call is a file + offset + size, so necessarily it should be done this way (i.e. per file).

Once buffers are sorted per file and offset within file, then written buffers are as close as possible one after the other, the merging is very easy to compute (it is done on the fly, no need to keep the list of buffers for instance), it is optimally effective, and when the checkpointed file changes then we will never go back to it before the next checkpoint, so there is no reason not to flush right then.

So basically I do not see a clear positive advantage to your suggestion, especially when taking into consideration the scheduling process of the scheduler:

In effect the checkpointer already works with little bursts of activity between sleep phases, so that it writes buffers a few at a time, so it may already work more or less as you expect, but not for the same reason.

The closest stategy that I experimented which is maybe close to your suggestion was to manage a minimum number of buffers to write when awaken and to change the sleep delay in between, but I had no clear way to choose values and the experiments I did did not show significant performance impact by varying these parameters, so I kept that out. If you find a magic number of buffer which results in consistant better performance, fine with me, but this is independent with aggregating before or after.

Imo that means that we'd better track writes on a relfilenode + block number level.

I do not think that it is a better option. Moreover, the current approach has been proven to be very effective on hundreds of runs, so redoing it differently for the sake of it does not look like good resource allocation.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to