Greg Smith wrote:
On Thu, 5 Jul 2007, Heikki Linnakangas wrote:

It looks like Tom's idea is not a winner; it leads to more writes than necessary.

What I came away with as the core of Tom's idea is that the cleaning/LRU writer shouldn't ever scan the same section of the buffer cache twice, because anything that resulted in a new dirty buffer will be unwritable by it until the clock sweep passes over it. I never took that to mean that idea necessarily had to be implemented as "trying to aggressively keep all pages with usage_count=0 clean".

I've been making slow progress on this myself, and the question I've been trying to answer is whether this fundamental idea really matters or not. One clear benefit of that alternate implementation should allow is setting a lower value for the interval without being as concerned that you're wasting resources by doing so, which I've found to a problem with the current implementation--it will consume a lot of CPU scanning the same section right now if you lower that too much.

Yes, in fact ignoring the CPU overhead of scanning the same section over and over again, Tom's proposal is the same as setting both bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2 test like that as well, and the # of writes was indeed the same, just with a max higher CPU usage. It's clear that scanning the same section over and over again has been a waste of time in previous releases.

As a further data point, I constructed a smaller test case that performs random DELETEs on a table using an index. I varied the # of shared_buffers, and ran the test with bgwriter disabled, or tuned all the way up to the maximum. Here's the results from that:

 shared_buffers | writes | writes |   writes_ratio
----------------+--------+--------+-------------------
 2560           |  86936 |  88023 |  1.01250345081439
 5120           |  81207 |  84551 |  1.04117871612053
 7680           |  75367 |  80603 |  1.06947337694216
 10240          |  69772 |  74533 |  1.06823654187926
 12800          |  64281 |  69237 |  1.07709898725907
 15360          |  58515 |  64735 |  1.10629753054772
 17920          |  53231 |  58635 |  1.10151979109917
 20480          |  48128 |  54403 |  1.13038148271277
 23040          |  43087 |  49949 |  1.15925917330053
 25600          |  39062 |  46477 |   1.1898264297783
 28160          |  35391 |  43739 |  1.23587917832217
 30720          |  32713 |  37480 |  1.14572188426619
 33280          |  31634 |  31677 |  1.00135929695897
 35840          |  31668 |  31717 |  1.00154730327144
 38400          |  31696 |  31693 | 0.999905350832913
 40960          |  31685 |  31730 |  1.00142023039293
 43520          |  31694 |  31650 | 0.998611724616647
 46080          |  31661 |  31650 | 0.999652569407157

The first writes-column is the # of writes with bgwriter disabled, 2nd column is with the aggressive bgwriter. The table size is 33334 pages, so after that the table fits in cache and the bgwriter strategy makes no difference.


As far as your results, first off I'm really glad to see someone else comparing checkpoint/backend/bgwriter writes the same I've been doing so I finally have someone else's results to compare against. I expect that the optimal approach here is a hybrid one that structures scanning the buffer cache the new way Tom suggests, but limits the number of writes to "just enough". I happen to be fond of the "just enough" computation based on a weighted moving average I wrote before, but there's certainly room for multiple implementations of that part of the code to evolve.

We need to get the requirements straight.

One goal of bgwriter is clearly to keep just enough buffers clean in front of the clock hand so that backends don't need to do writes themselves until the next bgwriter iteration. But not any more than that, otherwise we might end up doing more writes than necessary if some of the buffers are redirtied.

To deal with bursty workloads, for example a batch of 2 GB worth of inserts coming in every 10 minutes, it seems we want to keep doing a little bit of cleaning even when the system is idle, to prepare for the next burst. The idea is to smoothen the physical I/O bursts; if we don't clean the dirty buffers left over from the previous burst during the idle period, the I/O system will be bottlenecked during the bursts, and sit idle otherwise.

To strike a balance between cleaning buffers ahead of possible bursts in the future and not doing unnecessary I/O when no such bursts come, I think a reasonable strategy is to write buffers with usage_count=0 at a slow pace when there's no buffer allocations happening.

To smoothen the small variations on a relatively steady workload, the weighted average sounds good.




--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to