Hello Andres,

Hmmm. What I understood is that the workloads that have some performance
regressions (regressions that I have *not* seen in the many tests I ran) are
not due to checkpointer IOs, but rather in settings where most of the writes
is done by backends or bgwriter.

As far as I can see you've not run many tests where the hot/warm data
set is larger than memory (the full machine's memory, not
shared_buffers).

Indeed, I think I ran some, but not many with such characteristics.

That quite drastically alters the performance characteristics here, because you suddenly have lots of synchronous read IO thrown into the mix.

If I understand this point correctly...

I would expect the overall performance to be abysmal in such a situation because you get only intermixed *random* read and writes: As you point out, synchroneous *random* reads (very slow), but on the write side the IOs are mostly random as well on the checkpointer side because there is not much to aggregate to get sequential writes.

Now why would that degrade performance significantly? For me it should render the sorting/flushing less and less effective, and it would go back to the previous performance levels...

Or maybe it only the flushing itself which degrades performance, as you point out, because then you have some synchronous (synced) writes as well as read, as opposed to just the reads before without the patch.

If this is indeed the issue, then the solution to avoid the regression is *not* to flush so that the OS IO scheduler is less constrained in its job, and can be slightly more effective (well, we talking of abysmal random IO disk performance here, so effective would be between slightly more or less very very very bad).

Maybe a trick could be not to aggregate and flush when buffers in the same file are too much apart anyway, for instance, based on some threshold? This can be implemented locally when deciding to merge buffer flushes or not, and whether to flush or not, so it would fit the current code quite simply.

Now my understanding of the sync_file_range call is that it is an advice to flush the stuff, but it is still asynchronous in nature, so whether it would impact performance that badly depends on the OS IO scheduler. Also, I would like to check whether, under the "regressed performance" (in tps term that you observed), pg is more or less responsive. It could be that the average performance is better but pg is offline longer on fsync. In which case, I would consider it better to have lower tps in such cases *if* pg responsiveness is significantly improved.

Would you have these measures for the regression runs you observed?

Whether it's bgwriter or not I've not fully been able to establish, but
it's a working theory.

Ok, that is something to check for confirmation or infirmation.

Given the above discussion, I think my suggestion may be wrong: as the tps is low because of random read/write accesses then not many buffers are modified (so the bgwriter/backends won't need to make space), the checkpointer does not have much to write (good), *but* all of it is random (bad).

I do not see the point of rewriting the checkpointer for them, although
obviously I agree that something has to be done also for the other
processes.

Rewriting the checkpointer and fixing the flush interface in a more
generic way aren't the same thing at all.

Hmmm, probably I misunderstood something in the discussion. It started with an implementation strategy, but it derived to discussing a performance regression. I aggree that these are two different subjects.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to