In the interest of closing work on what's officially titled the "Automatic
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
think this is at, what I'm working on right now, and see if feedback from
that changes how I submit my final attempt for a useful patch in this area
this week. Hopefully there are enough free eyes to stare at this now to
wrap up a plan for what to do that makes sense and still fits in the 8.3
schedule. I'd hate to see this pushed off to 8.4 without making some
forward progress here after the amount of work done already, particularly
when odds aren't good I'll still be working with this code by then.
Let me start with a summary of the conclusions I've reached based on my
own tests and the set that Heikki did last month (last results at
http://community.enterprisedb.com/bgwriter/ ); Heikki will hopefully chime
in if he disagrees with how I'm characterizing things:
1) In the current configuration, if you have a large setting for
bgwriter_lru_percent and/or a small setting for bgwriter_delay, that can
be extremely wasteful because the background writer will consume
CPU/locking resources scanning the buffer pool needlessly. This problem
should go away.
2) Having backends write their own buffers out does not significantly
degrade performance, as those turn into cached OS writes which generally
execute fast enough to not be a large drag on the backend.
3) Any attempt to scan significantly ahead of the current strategy point
will result in some amount of premature writes that decreases overall
efficiency in cases where the buffer is touched again before it gets
re-used. The more in advance you go, the worse this inefficiency is.
The most efficient way for many workloads is to just let the backends do
all the writes.
4) Tom observed that there's no reason to ever scan the same section of
the pool more than once, because anything that changes a buffer's status
will always make it un-reusable until the strategy point has passed over
it. But because of (3), this does not mean that one should drive forward
constantly trying to lap the buffer pool and catch up with the strategy
point.
5) There hasn't been any definitive proof that the background writer is
helpful at all in the context of 8.3. However, yanking it out altogether
may be premature, as there are some theorized ways that it may be helpful
in real-world situations with more intermittant workloads than are
generally encountered in a benchmarking situation. I personally feel that
is some potential for the BGW to become more useful in the context of the
8.4 release if it starts doing things like adding pages it expects to be
recycled soon onto the free list, which could improve backend efficiency
quite a bit compared to the current situation where each backend is
normally running their own scan. But that's a bit too big to fit into 8.3
I think.
What I'm aiming for here is to have the BGW do as little work as possible,
as efficiently as possible, but not remove it altogether. (2) suggests
that this approach won't decrease performance compared to the current 8.2
situation, where I've seen evidence some are over-tuning to have a very
aggressive BGW scan an enormous amount of the pool each time because they
have resources to burn. Having a generally self-tuning background writer
that errs on the lazy side stay in the codebase satisfies (5). Here is
what the patch I'm testing right now does to try and balance all this out:
A) Counters are added to pg_stat_bgwriter that show how many buffers were
written by the backends, by the background writer, how many times
bgwriter_lru_maxpages was hit, and the total number of buffers allocated.
This at least allows monitoring what's going on as people run their own
experiments. Heikki's results included data using the earlier version of
this patch I put assembled (which now conflicts with HEAD, I have an
updated one).
B) bgwriter_lru_percent is removed as a tunable. This eliminates (1).
The idea of scanning a fixed percentage doesn't ever make sense given the
observations above; we scan until we accomplish the cleaning mission
instead.
C) bgwriter_lru_maxpages stays as an absolute maximum number of pages that
can be written in one sweep each bgwriter_delay. This allows easily
turning the writer off altogether by setting it to 0, or limiting how
active it tries to be in situations where (3) is a concern. Admins can
monitor the amount that the max is hit in pg_stat_bgwriter and consider
raising it (or lowering the delay) if it proves to be too limiting. I
think the default needs to be bumped to something more like 100 rather
than the current tiny one before the stock configuration can be considered
"self-tuning" at all.
D) The strategy code gets a "passes" count added to it that serves as a
sort of high-order int for how many times the buffer cache has been looked
over in its entirety.
E) When the background writer start the LRU cleaner, it checks if the
strategy point has passed where it last cleaned up to, using the
passes+buf_id "pointer". If so, it just starts cleaning from the strategy
point as it always has. But if it's still ahead it just continues from
there, thus implementing the core of (4)'s insight. It estimates how many
buffers are probably clean in the space between the strategy point and
where it's starting at, based on how far ahead it is combined with
historical data about how many buffers are scanned on average per reusable
buffer found (the exact computation of this number is the main thing I'm
still fiddling with).
F) A moving average of buffer allocations is used to predict how many
clean buffers are expected to be needed in the next delay cycle. The
original patch from Itagaki doubled the recent allocations to pad this
out; (3) suggests that's too much.
G) Scan the buffer pool until either
--Enough reusable buffers have been located or written out to fill the
upcoming allocation need, taking into account the estimate from (E); this
is the normal expected way the scan will terminate.
--We've written bgwriter_lru_maxpages
--We "lap" and catch the strategy point
In addition to removing a tunable and making the remaining two less
critical, one of my hopes here is that the more efficient way this scheme
operates will allow using much smaller values for bgwriter_delay than have
been practical in the current codebase, which may ultimately have its own
value.
That's what I've got working here now, still need some more tweaking and
testing before I'm done with the code but there's not much left. The main
problem I forsee is that this approach is moderately complicated, adding a
lot of new code and regular+static variables, for something that's not
really proven to be valuable. I will not be surprised if my patch is
rejected on that basis. That's why I wanted to get the big picture
painted in this message while I finish up the work necessary to submit it,
'cause if the whole idea is doomed anyway I might as well stop now.
--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend