Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

Jim Nasby Mon, 14 Apr 2014 16:04:30 -0700

On 4/14/14, 12:11 PM, Peter Geoghegan wrote:

I have some theories about the PostgreSQL buffer manager/clock sweep.
To motivate the reader to get through the material presented here, I
present up-front a benchmark of a proof-of-concept patch of mine:


http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/

Test Set 4 represents the patches performance here.

This shows some considerable improvements for a tpc-b workload, with
15 minute runs, where the buffer manager struggles with moderately
intense cache pressure. shared_buffers is 8GiB, with 32GiB of system
memory in total. The scale factor is 5,000 here, so that puts the
primary index of the accounts table at a size that makes it impossible
to cache entirely within shared_buffers, by a margin of couple of
GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63
GB". Obviously the heap is much larger, since for that table heap
tuples are several times the size of index tuples (the ratio here is
probably well below the mean, if I can be permitted to make a vast
generalization).

PostgreSQL implements a clock sweep algorithm, which gets us something
approaching an LRU for the buffer manager in trade-off for less
contention on core structures. Buffers have a usage_count/"popularity"
that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
algorithm only has one bit for what approximates our "usage_count" (so
it's either 0 or 1). I think that at its core CLOCK is an algorithm
that has some very desirable properties that I am sure must be
preserved. Actually, I think it's more accurate to say we use a
variant of clock pro, a refinement of the original CLOCK.


I think it's important to mention that OS implementations (at least all I know of) 
have multiple page pools, each of which has it's own clock. IIRC one of the 
arguments for us supporting a count>1 was we could get the benefits of multiple 
page pools without the overhead. In reality I believe that argument is false, 
because the clocks for each page pool in an OS *run at different rates* based on 
system demands.

I don't know if multiple buffer pools would be good or bad for Postgres, but I 
do think it's important to remember this difference any time we look at what 
OSes do.

If you look at the test sets that this patch covers (with all the
tricks applied), there are pretty good figures throughout. You can
kind of see the pain towards the end, but there are no dramatic falls
in responsiveness for minutes at a time. There are latency spikes, but
they're *far* shorter, and much better hidden. Without looking at
individual multiple minute spikes, at the macro level (all client
counts for all runs) average latency is about half of what is seen on
master.


My guess would be that those latency spikes are caused by a need to run the 
clock for an extended period. IIRC there's code floating around that makes it 
possible to measure that.

I suspect it would be very interesting to see what happens if your patch is 
combined with the work that (Greg?) did to reduce the odds of individual 
backends needing to run the clock. (I know part of that work looked at 
proactively keeping pages on the free list, but I think there was more to it 
than that).
--
Jim C. Nasby, Data Architect                       j...@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

Reply via email to