In the interest of closing work on what's officially titled the "Automatic adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I think this is at, what I'm working on right now, and see if feedback from that changes how I submit my final attempt for a useful patch in this area this week. Hopefully there are enough free eyes to stare at this now to wrap up a plan for what to do that makes sense and still fits in the 8.3 schedule. I'd hate to see this pushed off to 8.4 without making some forward progress here after the amount of work done already, particularly when odds aren't good I'll still be working with this code by then.

Let me start with a summary of the conclusions I've reached based on my own tests and the set that Heikki did last month (last results at http://community.enterprisedb.com/bgwriter/ ); Heikki will hopefully chime in if he disagrees with how I'm characterizing things:

1) In the current configuration, if you have a large setting for bgwriter_lru_percent and/or a small setting for bgwriter_delay, that can be extremely wasteful because the background writer will consume CPU/locking resources scanning the buffer pool needlessly. This problem should go away.

2) Having backends write their own buffers out does not significantly degrade performance, as those turn into cached OS writes which generally execute fast enough to not be a large drag on the backend.

3) Any attempt to scan significantly ahead of the current strategy point will result in some amount of premature writes that decreases overall efficiency in cases where the buffer is touched again before it gets re-used. The more in advance you go, the worse this inefficiency is. The most efficient way for many workloads is to just let the backends do all the writes.

4) Tom observed that there's no reason to ever scan the same section of the pool more than once, because anything that changes a buffer's status will always make it un-reusable until the strategy point has passed over it. But because of (3), this does not mean that one should drive forward constantly trying to lap the buffer pool and catch up with the strategy point.

5) There hasn't been any definitive proof that the background writer is helpful at all in the context of 8.3. However, yanking it out altogether may be premature, as there are some theorized ways that it may be helpful in real-world situations with more intermittant workloads than are generally encountered in a benchmarking situation. I personally feel that is some potential for the BGW to become more useful in the context of the 8.4 release if it starts doing things like adding pages it expects to be recycled soon onto the free list, which could improve backend efficiency quite a bit compared to the current situation where each backend is normally running their own scan. But that's a bit too big to fit into 8.3 I think.

What I'm aiming for here is to have the BGW do as little work as possible, as efficiently as possible, but not remove it altogether. (2) suggests that this approach won't decrease performance compared to the current 8.2 situation, where I've seen evidence some are over-tuning to have a very aggressive BGW scan an enormous amount of the pool each time because they have resources to burn. Having a generally self-tuning background writer that errs on the lazy side stay in the codebase satisfies (5). Here is what the patch I'm testing right now does to try and balance all this out:

A) Counters are added to pg_stat_bgwriter that show how many buffers were written by the backends, by the background writer, how many times bgwriter_lru_maxpages was hit, and the total number of buffers allocated. This at least allows monitoring what's going on as people run their own experiments. Heikki's results included data using the earlier version of this patch I put assembled (which now conflicts with HEAD, I have an updated one).

B) bgwriter_lru_percent is removed as a tunable. This eliminates (1). The idea of scanning a fixed percentage doesn't ever make sense given the observations above; we scan until we accomplish the cleaning mission instead.

C) bgwriter_lru_maxpages stays as an absolute maximum number of pages that can be written in one sweep each bgwriter_delay. This allows easily turning the writer off altogether by setting it to 0, or limiting how active it tries to be in situations where (3) is a concern. Admins can monitor the amount that the max is hit in pg_stat_bgwriter and consider raising it (or lowering the delay) if it proves to be too limiting. I think the default needs to be bumped to something more like 100 rather than the current tiny one before the stock configuration can be considered "self-tuning" at all.

D) The strategy code gets a "passes" count added to it that serves as a sort of high-order int for how many times the buffer cache has been looked over in its entirety.

E) When the background writer start the LRU cleaner, it checks if the strategy point has passed where it last cleaned up to, using the passes+buf_id "pointer". If so, it just starts cleaning from the strategy point as it always has. But if it's still ahead it just continues from there, thus implementing the core of (4)'s insight. It estimates how many buffers are probably clean in the space between the strategy point and where it's starting at, based on how far ahead it is combined with historical data about how many buffers are scanned on average per reusable buffer found (the exact computation of this number is the main thing I'm still fiddling with).

F) A moving average of buffer allocations is used to predict how many clean buffers are expected to be needed in the next delay cycle. The original patch from Itagaki doubled the recent allocations to pad this out; (3) suggests that's too much.

G) Scan the buffer pool until either
--Enough reusable buffers have been located or written out to fill the upcoming allocation need, taking into account the estimate from (E); this is the normal expected way the scan will terminate.
  --We've written bgwriter_lru_maxpages
  --We "lap" and catch the strategy point

In addition to removing a tunable and making the remaining two less critical, one of my hopes here is that the more efficient way this scheme operates will allow using much smaller values for bgwriter_delay than have been practical in the current codebase, which may ultimately have its own value.

That's what I've got working here now, still need some more tweaking and testing before I'm done with the code but there's not much left. The main problem I forsee is that this approach is moderately complicated, adding a lot of new code and regular+static variables, for something that's not really proven to be valuable. I will not be surprised if my patch is rejected on that basis. That's why I wanted to get the big picture painted in this message while I finish up the work necessary to submit it, 'cause if the whole idea is doomed anyway I might as well stop now.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to