I just had an epiphany, I think. As I wrote in the LDC discussion, http://archives.postgresql.org/pgsql-patches/2007-06/msg00294.php if the bgwriter's LRU-cleaning scan has advanced ahead of freelist.c's clock sweep pointer, then any buffers between them are either clean, or are pinned and/or have usage_count > 0 (in which case the bgwriter wouldn't bother to clean them, and freelist.c wouldn't consider them candidates for re-use). And *this invariant is not destroyed by the activities of other backends*. A backend cannot dirty a page without raising its usage_count from zero, and there are no race cases because the transition states will be pinned.
This means that there is absolutely no point in having the bgwriter re-start its LRU scan from the clock sweep position each time, as it currently does. Any pages it revisits are not going to need cleaning. We might as well have it progress forward from where it stopped before. In fact, the notion of the bgwriter's cleaning scan being "in front of" the clock sweep is entirely backward. It should try to be behind the sweep, ie, so far ahead that it's lapped the clock sweep and is trailing along right behind it, cleaning buffers immediately after their usage_count falls to zero. All the rest of the buffer arena is either clean or has positive usage_count. This means that we don't need the bgwriter_lru_percent parameter at all; all we need is the lru_maxpages limit on how much I/O to initiate per wakeup. On each wakeup, the bgwriter always cleans until either it's dumped lru_maxpages buffers, or it's caught up with the clock sweep. There is a risk that if the clock sweep manages to lap the bgwriter, the bgwriter would stop upon "catching up", when in reality there are dirty pages everywhere. This is easily prevented though, if we add to the shared BufferStrategyControl struct a counter that is incremented each time the clock sweep wraps around to buffer zero. (Essentially this counter stores the high-order bits of the sweep counter.) The bgwriter can then recognize having been lapped by comparing that counter to its own similar counter. If it does get lapped, it should advance its work pointer to the current sweep pointer and try to get ahead again. (There's no point in continuing to clean pages behind the sweep when those just ahead of it are dirty.) This idea changes the terms of discussion for Itagaki-san's automatic-adjustment-of-lru_maxpages patch. I'm not sure we'd still need it at all, as lru_maxpages would now be just an upper bound on the desired I/O rate, rather than the target itself. If we do still need such a patch, it probably needs to look a lot different than it does now. Comments? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly