Re: [HACKERS] checkpointer continuous flushing

Fabien COELHO Thu, 07 Jan 2016 12:09:47 -0800


Hello Andres,

One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the
thread, so it makes sense to limit this cost, hence the aggregation. These
removed some performation regression I had in some cases.


FWIW, my tests show that flushing for clean ranges is pretty cheap.

Yes, I agree that it is quite cheap, but I had a few % tps regressionsin some cases without aggregating, and aggregating was enough to avoidthese small regressions.

Also, the granularity of the buffer flush call is a file + offset + size, so
necessarily it should be done this way (i.e. per file).


What syscalls we issue, and at what level we track outstanding flushes,
doesn't have to be the same.

Sure. But the current version is simple, efficient and proven by manyruns, so there should be a very strong argument to justify a significantbenefit to change the approach, and I see no such thing in your arguments.

For me the current approach is optimal for the checkpointer, because ittakes advantage of all available information to perform a better job.

Once buffers are sorted per file and offset within file, then written
buffers are as close as possible one after the other, the merging is very
easy to compute (it is done on the fly, no need to keep the list of buffers
for instance), it is optimally effective, and when the checkpointed file
changes then we will never go back to it before the next checkpoint, so
there is no reason not to flush right then.


Well, that's true if there's only one tablespace, but e.g. not the case
with two tablespaces of about the same number of dirty buffers.

ISTM that in the version of the patch I sent there was one flushingstructure per tablespace each doing its own flushing on its files, so itshould work the same, only the writing intensity is devided by the numberof tablespace? Or am I missing something?

So basically I do not see a clear positive advantage to your suggestion,
especially when taking into consideration the scheduling process of the
scheduler:


I don't think it makes a big difference for the checkpointer alone, but
it makes the interface much more suitable for other processes, e.g. the
bgwriter, and normal backends.


Hmmm.

ISTM that the requirement are not exactly the same for the bgwriter andbackends vs the checkpointer. The checkpointer has the advantage of beingable to plan its IOs on the long term (volume & time is known...) and theimplementation takes the full benefit of this planing by sorting andscheduling and flushing buffers so as to generate as much sequentialwrites as possible.

The bgwriter and backends have a much shorter vision (a few seconds, orjuste one query being process), so the solution will be less efficient andprobably more messy on the coding side. This is life. I do not see why notto take the benefit of a full planing in the checkpointer just becauseother processes cannot do the same, especially as under plenty of loadsthe checkpointer does most of the writing so is the limiting factor.

So I do not buy your suggestion for the checkpointer. Maybe it will be theway to go for bgwriter and backends, then fine for them.

Imo that means that we'd better track writes on a relfilenode + block
number level.


I do not think that it is a better option. Moreover, the current approach
has been proven to be very effective on hundreds of runs, so redoing it
differently for the sake of it does not look like good resource allocation.


For a subset of workloads, yes.

Hmmm. What I understood is that the workloads that have some performanceregressions (regressions that I have *not* seen in the many tests I ran)are not due to checkpointer IOs, but rather in settings where most of thewrites is done by backends or bgwriter.

I do not see the point of rewriting the checkpointer for them, althoughobviously I agree that something has to be done also for the otherprocesses.

Maybe if all the writes (bgwriter and checkpointer) where performed by thesame process then some dynamic mixing and sorting and aggregating wouldmake sense, but this is currently not the case, and would probably havequite limited effect.

Basically I do not understand how changing the flushing organisation asyou suggest would improve the checkpointer performance significantly, forme it should only degrade the performance compared to the current version,as far as the checkpointer is concerned.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

Reply via email to