Renaming the old thread to more appropriately address the topic:

On Wed, 5 Sep 2007, Kevin Grittner wrote:

Then I would test the new background writer with synchronous commits under
the 8.3 beta, using various settings.  The 0.5, 0.7 and 0.9 settings you
recommended for a test are how far from the LRU end of the cache to look
for dirty pages to write, correct?

This is alluding to the suggestions I gave at http://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php

checkpoint_completion_target has nothing to do with the LRU, so let's step back to fundamentals and talk about what it actually does. The official documentation is at http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html

As you generate transactions, Postgres puts data into the WAL. The WAL is organized into segments that are typically 16MB each. Periodically, the system hits a checkpoint where the WAL data up to a certain point is guaranteed to have been applied to the database, at which point the old WAL files aren't needed anymore and can be reused. These checkpoints are generally caused by one of two things happening:

1) checkpoint_segments worth of WAL files have been written
2) more than checkpoint_timeout seconds have passed since the last checkpoint

The system doesn't stop working while the checkpoint is happening; it just keeps creating new WAL files. As long as the checkpoint finishes in advance of what the next one is required things performance should be fine.

In the 8.2 model, processing the checkpoint occurs as fast as data can be written to disk. In 8.3, the writes can be spread out instead. What checkpoint_completion_target does is suggest how far along the system should aim to have finished the current checkpoint relative to when the next one is expected.

For example, your current system has checkpoint_segments=10. Assume that you have checkpoint_timeout set to a large number such that the checkpoints are typically being driven by the number of segments being filled (so you get a checkpoint every 10 WAL segments, period). If checkpoint_completion_target was set to 0.5, the expectation is that the writes for the currently executing checkpoint would be finished about the time that 0.5*10=5 segments of new WAL data had been written. If you set it to 0.9 instead, you'd expect the checkpoint is finishing just about when the 9th WAL segment is being written out, which is cutting things a bit tight; somewhere around there is the safe upper limit for that parameter.

Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing that on your current system that's forcing very regular checkpoints, which makes each individual checkpoint have less work to do and therefore reduces the impact of the spikes you're trying to avoid. With LDC and checkpoint_completion_target, you can make that number much bigger (I suggested 50), which means you'll only have 1/5 as many checkpoints causing I/O spikes, and each of those checkpoints will have 5X as long to potentially spread the writes over. The main cost is that it will take longer to recover if your database crashes, which hopefully is a rare event.

Having far less checkpoints is obviously a win for your situation, but the open question is whether this fashion of spreading them out will reduce the I/O spike as effectively as the all-scan background writer in 8.2 has been working for you. This is one aspect that makes your comparision a bit tricky. It's possible that by increasing the segments enough, you'll get into a situation where you don't see (m)any of them during your testing run of 8.3. You should try and collect some data on how regularly checkpoints are happening during early testing to get an idea if this is a possibility. The usual approach is to set checkpoint_warning to a really high number (like the maximum of 3600) and then you'll get a harmless note in the logs every time one happens, and that will show you how frequently they're happening. It's kind of important to have an idea how many checkpoints you can expect during each test run to put together a fair comparison; as you increase checkpoint_segments, you need to adopt a mindset that is considering "how many sluggish transactions am I seeing per checkpoint?", not how many total per test run.

I have a backport of some of the pg_stat_bgwriter features added to 8.3 that can be applied to 8.2 that might be helpful for monitoring your test benchmarking server (this is most certainly *not* suitable to go onto the real one) at http://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you might want to take a look at; I put that together specifically for allowing easier comparisions of 8.2 and 8.3 in this area.

Are the current shared memory and the 1 GB you suggested enough of a spread for these tests? (At several hours per test in order to get meaningful results, I don't want to get into too many permutations.)

Having a much larger shared_buffers setting should allow you to keep more data in memory usefully, which may lead to an overall performance gain due to improved efficiency. With your current configuration, I would guess that making the buffer cache bigger would increase the checkpoint spike problems, where that shouldn't be as much of a problem with 8.3 because of how the checkpoint can be spread out. The hope here is that by letting PostgreSQL cache more and avoiding writes of popular buffers except at checkpoint time, your total I/O will be significantly lower with 8.3 compared to how much an aggressive BGW will write in 8.2. Right now, you've got a pretty low number of pages that accumulate a high usage count; that may change if you give the buffer cache a lot more room to work.

Finally, I would try the new checkpoint techniques, with and without the
new background writer.  Any suggestions on where to set the knobs for
those runs?

This and your related question about simulating the new LRU behavior by "turning off the 'all' scan and setting the lru scan percentage to 50% or more" depend on what final form the LRU background writer ends up in. Certainly you should consider using a higher value for the percentage and maxpages parameters with the current form 8.3 is in because you're not having the all scan doing the majority of the work anymore. If some form of my JIT BGW patch gets applied before beta, you'll still want to increase maxpages but don't have to play with the percentage anymore; you might try adjusting the multiplier setting instead.

I'm inclined to think that it would be interesting to try the benchmarks with the backend writing any dirty page through to the OS at the same time they are written to the PostgreSQL cache, as a reference point at the opposite extreme from having the cache hold onto dirty pages for as long as possible before sharing them with the OS. Do you see any value in getting actual numbers for that?

It might be an interesting curiousity to see how this works for you, but I'm not sure of its value to the community at large. The configuration trend for larger systems seems to be pretty clear at this point: use large values for shared_buffers and checkpoint_segments. Minimize total I/O in the background writer by not writing more than you have to, only even consider writing buffers that are going to be reused in the near future regularly; everything else only gets written out at checkpoint time. I consider the fact that you've gotten good results in the past by a radically different configuration than what's considered normal best practice, a configuration that works around problems in 8.2, an interesting data point. I don't see any reason that anyone would jump from there to expecting that turning the PostgreSQL cache into what's essentially a write-through one the way you describe here will be helpful in most cases, and I'm not sure how you would do it anyway.

What I would encourage you to take a look at while you're doing these experiments is radically lowering the Linux dirty_background_ratio tunable (perhaps even to 0) to see what that does for you. From what I've seen in the past, the caching there is more likely to be the root of your problem. Hopefully LDC will address your issue such that you don't have to adjust this, because it will lower efficiency considerably, but it may be the most straightforward way to get the more timely I/O path you're obviously looking for.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to