[HACKERS] Testing 8.3 LDC vs. 8.2.4 with aggressive BGW

Greg Smith Mon, 10 Sep 2007 22:07:56 -0700

Renaming the old thread to more appropriately address the topic:


On Wed, 5 Sep 2007, Kevin Grittner wrote:

Then I would test the new background writer with synchronous commits under
the 8.3 beta, using various settings.  The 0.5, 0.7 and 0.9 settings you
recommended for a test are how far from the LRU end of the cache to look
for dirty pages to write, correct?

This is alluding to the suggestions I gave athttp://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php

checkpoint_completion_target has nothing to do with the LRU, so let's stepback to fundamentals and talk about what it actually does. The officialdocumentation is athttp://developer.postgresql.org/pgdocs/postgres/wal-configuration.html

As you generate transactions, Postgres puts data into the WAL. The WAL isorganized into segments that are typically 16MB each. Periodically, thesystem hits a checkpoint where the WAL data up to a certain point isguaranteed to have been applied to the database, at which point the oldWAL files aren't needed anymore and can be reused. These checkpoints aregenerally caused by one of two things happening:


1) checkpoint_segments worth of WAL files have been written

2) more than checkpoint_timeout seconds have passed since the lastcheckpoint

The system doesn't stop working while the checkpoint is happening; it justkeeps creating new WAL files. As long as the checkpoint finishes inadvance of what the next one is required things performance should befine.

In the 8.2 model, processing the checkpoint occurs as fast as data can bewritten to disk. In 8.3, the writes can be spread out instead. Whatcheckpoint_completion_target does is suggest how far along the systemshould aim to have finished the current checkpoint relative to when thenext one is expected.

For example, your current system has checkpoint_segments=10. Assume thatyou have checkpoint_timeout set to a large number such that thecheckpoints are typically being driven by the number of segments beingfilled (so you get a checkpoint every 10 WAL segments, period). Ifcheckpoint_completion_target was set to 0.5, the expectation is that thewrites for the currently executing checkpoint would be finished about thetime that 0.5*10=5 segments of new WAL data had been written. If you setit to 0.9 instead, you'd expect the checkpoint is finishing just aboutwhen the 9th WAL segment is being written out, which is cutting things abit tight; somewhere around there is the safe upper limit for thatparameter.

Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing thaton your current system that's forcing very regular checkpoints, whichmakes each individual checkpoint have less work to do and thereforereduces the impact of the spikes you're trying to avoid. With LDC andcheckpoint_completion_target, you can make that number much bigger (Isuggested 50), which means you'll only have 1/5 as many checkpointscausing I/O spikes, and each of those checkpoints will have 5X as long topotentially spread the writes over. The main cost is that it will takelonger to recover if your database crashes, which hopefully is a rareevent.

Having far less checkpoints is obviously a win for your situation, but theopen question is whether this fashion of spreading them out will reducethe I/O spike as effectively as the all-scan background writer in 8.2 hasbeen working for you. This is one aspect that makes your comparision abit tricky. It's possible that by increasing the segments enough, you'llget into a situation where you don't see (m)any of them during yourtesting run of 8.3. You should try and collect some data on how regularlycheckpoints are happening during early testing to get an idea if this is apossibility. The usual approach is to set checkpoint_warning to a reallyhigh number (like the maximum of 3600) and then you'll get a harmless notein the logs every time one happens, and that will show you how frequentlythey're happening. It's kind of important to have an idea how manycheckpoints you can expect during each test run to put together a faircomparison; as you increase checkpoint_segments, you need to adopt amindset that is considering "how many sluggish transactions am I seeingper checkpoint?", not how many total per test run.

I have a backport of some of the pg_stat_bgwriter features added to 8.3that can be applied to 8.2 that might be helpful for monitoring your testbenchmarking server (this is most certainly *not* suitable to go onto thereal one) athttp://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you mightwant to take a look at; I put that together specifically for allowingeasier comparisions of 8.2 and 8.3 in this area.

Are the current shared memory and the 1 GB you suggested enough of aspread for these tests? (At several hours per test in order to getmeaningful results, I don't want to get into too many permutations.)

Having a much larger shared_buffers setting should allow you to keep moredata in memory usefully, which may lead to an overall performance gain dueto improved efficiency. With your current configuration, I would guessthat making the buffer cache bigger would increase the checkpoint spikeproblems, where that shouldn't be as much of a problem with 8.3 because ofhow the checkpoint can be spread out. The hope here is that by lettingPostgreSQL cache more and avoiding writes of popular buffers except atcheckpoint time, your total I/O will be significantly lower with 8.3compared to how much an aggressive BGW will write in 8.2. Right now,you've got a pretty low number of pages that accumulate a high usagecount; that may change if you give the buffer cache a lot more room towork.

Finally, I would try the new checkpoint techniques, with and without the
new background writer.  Any suggestions on where to set the knobs for
those runs?

This and your related question about simulating the new LRU behavior by"turning off the 'all' scan and setting the lru scan percentage to 50% ormore" depend on what final form the LRU background writer ends up in.Certainly you should consider using a higher value for the percentage andmaxpages parameters with the current form 8.3 is in because you're nothaving the all scan doing the majority of the work anymore. If some formof my JIT BGW patch gets applied before beta, you'll still want toincrease maxpages but don't have to play with the percentage anymore; youmight try adjusting the multiplier setting instead.

I'm inclined to think that it would be interesting to try the benchmarkswith the backend writing any dirty page through to the OS at the sametime they are written to the PostgreSQL cache, as a reference point atthe opposite extreme from having the cache hold onto dirty pages for aslong as possible before sharing them with the OS. Do you see any valuein getting actual numbers for that?

It might be an interesting curiousity to see how this works for you, butI'm not sure of its value to the community at large. The configurationtrend for larger systems seems to be pretty clear at this point: uselarge values for shared_buffers and checkpoint_segments. Minimize totalI/O in the background writer by not writing more than you have to, onlyeven consider writing buffers that are going to be reused in the nearfuture regularly; everything else only gets written out at checkpointtime. I consider the fact that you've gotten good results in the past bya radically different configuration than what's considered normal bestpractice, a configuration that works around problems in 8.2, aninteresting data point. I don't see any reason that anyone would jumpfrom there to expecting that turning the PostgreSQL cache into what'sessentially a write-through one the way you describe here will be helpfulin most cases, and I'm not sure how you would do it anyway.

What I would encourage you to take a look at while you're doing theseexperiments is radically lowering the Linux dirty_background_ratio tunable(perhaps even to 0) to see what that does for you. From what I've seen inthe past, the caching there is more likely to be the root of your problem.Hopefully LDC will address your issue such that you don't have to adjustthis, because it will lower efficiency considerably, but it may be themost straightforward way to get the more timely I/O path you're obviouslylooking for.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] Testing 8.3 LDC vs. 8.2.4 with aggressive BGW

Reply via email to