Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Heikki Linnakangas Sun, 16 Jun 2013 07:29:01 -0700

On 10.06.2013 13:51, KONDO Mitsumasa wrote:

I create patch which is improvement of checkpoint IO scheduler for
stable transaction responses.


* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint
scheduler has two problems at start and end of checkpoint. One problem
is IO heavy when starting initial checkpoint in rounds of checkpoint.
This problem was caused by full-page-write which cause WAL IO in fast
page writes after checkpoint write page. Therefore, when starting
checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
schedule by full-page-write, nevertheless checkpoint schedule is not
late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint.

Yeah, the checkpoint scheduling logic doesn't take into account theheavy WAL activity caused by full page images. That's an interestingphenomenon, but did you actually see that causing a problem in yourtests? I couldn't tell from the results you posted what the impact ofthat was. Could you repeat the tests separately with the two separatepatches you posted later in this thread?

Rationalizing a bit, I could even argue to myself that it's a *good*thing. At the beginning of a checkpoint, the OS write cache should berelatively empty, as the checkpointer hasn't done any writes yet. So itmight make sense to write a burst of pages at the beginning, topartially fill the write cache first, before starting to throttle. Butthis is just handwaving - I have no idea what the effect is in real life.

Another thought is that rather than trying to compensate for that effectin the checkpoint scheduler, could we avoid the sudden rush of full-pageimages in the first place? The current rule for when to write a fullpage image is conservative: you don't actually need to write a full pageimage when you modify a buffer that's sitting in the buffer cache, ifthat buffer hasn't been flushed to disk by the checkpointer yet, becausethe checkpointer will write and fsync it later. I'm not sure how much itwould smoothen WAL write I/O, but it would be interesting to try.

Second problem is fsync freeze problem in end of checkpoint.
Normally, checkpoint write is executed in background by OS's IO
scheduler. But when it does not correctly work, end of checkpoint
fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease
user-experience in application service. It is especially serious
problem in cloud and virtual server database system which does not
have IO performance. However we don't have solution in
postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it
becomes little bit long that is not problem. You may think that
checkpoint_segments and checkpoint_timeout are set larger value,
however large checkpoint_segments affects file-cache which is not
read and is wasted, and large checkpoint_timeout was caused
long-time crash-recovery.

A long time ago, Itagaki wrote a patch to sort the checkpoint writes:www.postgresql.org/message-id/flat/[email protected].He posted very promising performance numbers, but it was dropped becauseTom couldn't reproduce the numbers, and because sorting requiresallocating a large array, which has the risk of running out of memory,which would be bad when you're trying to checkpoint.

Apart from the direct performance impact of that patch, sorting thewrites would allow us to interleave the fsyncs with the writes. Youwould write out all buffers for relation A, then fsync it, then allbuffers for relation B, then fsync it, and so forth. That wouldnaturally spread out the fsyncs.

If we don't mind scanning the buffer cache several times, we don'tnecessarily even need to sort the writes for that. Just scan the buffercache for all buffers belonging to relation A, then fsync it. Then scanthe buffer cache again, for all buffers belonging to relation B, thenfsync that, and so forth.

Bad point of my patch is longer checkpoint. Checkpoint time was
increased about 10% - 20%. But it can work correctry on schedule-time in
checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).

For a fair comparison, you should increase thecheckpoint_completion_target of the unpatched test, so that thecheckpoints run for roughly the same amount of time with and without thepatch. Otherwise the benefit you're seeing could be just because of amore lazy checkpoint.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to