On Jan 3, 2012, at 7:34 PM, Robert Haas wrote:
> On Tue, Jan 3, 2012 at 7:11 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Jim Nasby <j...@nasby.net> writes:
>>> Yeah, but the problem we run into is that with every backend trying to run 
>>> the clock on it's own we end up with high contention again... it's just in 
>>> a different place than when we had a true LRU. The clock sweep might be 
>>> cheaper than the linked list was, but it's still awfully expensive. I 
>>> believe our best bet is to have a free list that is actually useful in 
>>> normal operations, and then optimize the cost of pulling buffers out of 
>>> that list as much as possible (and let the bgwriter deal with keeping 
>>> enough pages in that list to satisfy demand).
>> 
>> Well, maybe, but I think the historical evidence suggests that that
>> approach will be a loser, simply because no matter how cheap, the
>> freelist will remain a centralized and heavily contended data structure.
>> IMO we need to be looking for a mechanism that has no single point of
>> contention, and modifying the clock sweep rules looks like the best way
>> to get there.
>> 
>> Still, he who wants to do the work can try whatever approach he feels
>> like.
> 
> It might be possible to partition the free list.  So imagine, for
> example, 8 free lists.  The background writer runs the clock sweep and
> finds some buffers that are about ready to be reallocated and
> distributes one-eighth of them to each free list.  Then, when a
> backend wants to allocate a buffer, it picks a free list somehow
> (round robin?) and pulls a buffer off it.  If the buffer turns out to
> have been used since it was added to the free list, we give up on it
> and try again.  This hopefully shouldn't happen too often, though, as
> long as we only add enough buffers to the free list to satisfy the
> requests that we expect to get over the next
> small-fraction-of-a-second.
> 
> Of course you have to think about what happens if you find that your
> chosen free list is empty.  In that case you probably want to try a
> different one, and if in the worst case where they're all empty, run
> the clock sweep in the foreground.  You probably also want to kick the
> background writer in the pants if even a single one is empty (or
> nearly empty?) and tell it to hurry up and add some more buffers to
> the freelist.

If it comes down to it, we can look at partitioning the free list. But here's 
the thing: this is the strategy that FreeBSD (and I think now Linux as well) 
use to service memory requests, be they for free memory or for reading data 
from disk. If it's good enough for an OS to use, I would expect we could make 
it work as well.

I would expect that pulling a page off the free list would be an extremely fast 
operation... lock the read lock on the list (we should probably have separate 
locks for putting stuff on the list vs taking it off), pin the buffer (which 
shouldn't be contentious), update the read pointer and drop the read lock. Even 
in C code that's probably less than 100 machine instructions, and no looping. 
Compared to the cost of running a clock sweep it should be significantly 
faster. So while there will be contention on the free list read lock, none of 
that contention should last very long.

Do we have any other places where we take extremely short locks like this and 
still run into problems?
--
Jim C. Nasby, Database Architect                   j...@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to