On 2013-06-27 08:23:31 -0400, Robert Haas wrote:
> I'd like to just back up a minute here and talk about the broader
> picture here.
Sounds like a very good plan.
> So in other words,
> there's no huge *performance* problem for a working set larger than
> shared_buffers, but there is a huge *scalability* problem. Now why is
> that?
> As far as I can tell, the answer is that we've got a scalability
> problem around BufFreelistLock.
Part of the problem is it's name ;)
> Contention on the buffer mapping
> locks may also be a problem, but all of my previous benchmarking (with
> LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
> in the room.
Contention wise I aggree. What I have seen is that we have a huge
amount of cacheline bouncing around the buffer header spinlocks.
> My interest in having the background writer add buffers
> to the free list is basically around solving that problem. It's a
> pretty dramatic problem, as the graph above shows, and this patch
> doesn't solve it.
> One thing that occurred to me while writing this note is that the
> background writer doesn't have any compelling reason to run on a
> read-only workload. It will still run at a certain minimum rate, so
> that it cycles the buffer pool every 2 minutes, if I remember
> correctly.
I have previously added some adhoc instrumentation that printed the
amount of buffers that were required (by other backends) during a
bgwriter cycle and the amount of buffers that the buffer manager could
actually write out. I don't think I actually found any workload where
the bgwriter actually wroute out a relevant percentage of the necessary
pages.
Which would explain why the patch doesn't have a big benefit. The
freelist is empty most of the time, so we don't benefit from the reduced
work done under the lock.
I think the whole algorithm that guides how much the background writer
actually does, including its pacing/sleeping logic, needs to be
rewritten from scratch before we are actually able to measure the
benefit from this patch. I personally don't think there's much to
salvage from the current code.
Problems with the current code:
* doesn't manipulate the usage_count and never does anything to used
pages. Which means it will just about never find a victim buffer in a
busy database.
* by far not aggressive enough, touches only a few buffers ahead of the
clock sweep.
* does not advance the clock sweep, so the individual backends will
touch the same buffers again and transfer all the buffer spinlock
cacheline around
* The adaption logic it has makes it so slow to adapt that it takes
several minutes to adapt.
* ...
There's another thing we could do to noticeably improve scalability of
buffer acquiration. Currently we do a huge amount of work under the
freelist lock.
In StrategyGetBuffer:
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
...
// check freelist, will usually be empty
...
for (;;)
{
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
++StrategyControl->nextVictimBuffer;
LockBufHdr(buf);
if (buf->refcount == 0)
{
if (buf->usage_count > 0)
{
buf->usage_count--;
}
else
{
/* Found a usable buffer */
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}
}
UnlockBufHdr(buf);
}
So, we perform the entire clock sweep until we found a single buffer we
can use inside a *global* lock. At times we need to iterate over the
whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
the usage counts enough (if the database is busy it can take even
longer...).
In a busy database where usually all the usagecounts are high the next
backend will touch a lot of those buffers again which causes massive
cache eviction & bouncing.
It seems far more sensible to only protect the clock sweep's
nextVictimBuffer with a spinlock. With some care all the rest can happen
without any global interlock.
I think even after fixing this - which we definitely should do - having
a sensible/more aggressive bgwriter moving pages onto the freelist makes
sense because then backends then don't need to deal with dirty pages.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers