Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-21 Thread Robert Haas
On Mon, Apr 20, 2015 at 2:53 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 I think that would help, but it still leaves user backends trying to advance
 the clock, which is quite painful. Has anyone tested running the clock in
 the background? We need a wiki page with all the ideas that have been tested
 around buffer management...

Amit's bgreclaimer patch did that, but we weren't able to demonstrate
a clear benefit.  I haven't given up on the idea yet, but we've got to
be able to prove that it's a good idea in practice as well as in
theory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Peter Geoghegan
On Tue, Apr 14, 2015 at 7:02 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan p...@heroku.com wrote:
 Why is that good?

 We did discuss this before.  I've recapped some of what I believe to
 be the most salient points below.

 I think that people were all too quick to dismiss the idea of a wall
 time interval playing some role here (at least as a defense against
 correlated references, as a correlated reference period). I suppose
 that that's because it doesn't fit with an intuition that says that
 that kind of interval ought to be derived algebraically - magic delay
 settings are considered suspect.

 Yep, Tom gave that reason here:

 http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us

 But there was also this point from Andres - gettimeofday is not free:

 http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de

 And this point from me - this can degrade to random eviction under
 high pressure:

 http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com

 You'll notice that my proposal avoids all three of those objections.

All I'm saying is that you shouldn't dismiss the idea without trying
it out properly. The LRU-K paper, from the early 1990s, recommends
this, and gettimeofday() calls were a lot more expensive back then.
I'm sure that there is a way to overcome these issues if it turns out
to be worth it (by amortizing gettimeofday() calls by driving it from
an auxiliary process like the bgwriter, for example). In fact, I'm
almost certain there is, because at least one other major database
system uses just such a reference period.

Back when ARC (or was it 2Q?) was committed before being reverted
shortly thereafter, there was a similar idea actually implemented, but
with XIDs preventing correlated references (which the LRU-K paper also
hints at). I think that an actual delay is more robust than that,
though. Even though this correlated reference delay is all I
implemented with my prototype,it's just one piece of the puzzle.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Merlin Moncure
On Mon, Apr 20, 2015 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
 klep...@svana.org wrote:
 I've been following this thread from the side with interest and got
 twigged by the point about loss of information.  If you'd like better
 information about relative ages, you can acheive this by raising the
 cap on the usage count and dividing (or right-shifting) each sweep.

 Yeah, I thought about that, too.  It might be worth experimenting with.

 This would allow you to remember much more about about the relative
 worth of often used pages.  With a cap of 32 you'd have the same effect
 as now where after 5 sweeps the buffer is evicted.  Mathematically the
 count would converge to the number of times the block is used per
 sweep.

 Hmm, interesting point.  It's possible that we'd still have problems
 with everything maxing out at 32 on some workloads, but at least it'd
 be a little harder to max out at 32 than at 5.

Do we have any reproducible test cases to evaluate these assumptions?
 I haven't looked at this stuff for a while, but my main issue with
the clock sweep was finding sweep heavy cases that did not also have
trouble with the buffer mapping locks (although the facts on the
ground my have changed since then).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Robert Haas
On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout
klep...@svana.org wrote:
 I've been following this thread from the side with interest and got
 twigged by the point about loss of information.  If you'd like better
 information about relative ages, you can acheive this by raising the
 cap on the usage count and dividing (or right-shifting) each sweep.

Yeah, I thought about that, too.  It might be worth experimenting with.

 This would allow you to remember much more about about the relative
 worth of often used pages.  With a cap of 32 you'd have the same effect
 as now where after 5 sweeps the buffer is evicted.  Mathematically the
 count would converge to the number of times the block is used per
 sweep.

Hmm, interesting point.  It's possible that we'd still have problems
with everything maxing out at 32 on some workloads, but at least it'd
be a little harder to max out at 32 than at 5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Robert Haas
On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote:
 This is my point though (you're right that flushed isn't always the
 same as eviction but that's not the important point here). Right now
 we only demote when we consider buffers for eviction. But we promote
 when we pin buffers. Those two things aren't necessarily happening at
 the same rate and in fact are often orders of magnitude different.

I am absolutely, positively, violently in 100% agreement with this.  I
have made the same point before, but it sure is nice to hear someone
else thinking about it the same way.

 So
 it makes sense that we end up with a lot of buffers promoted all the
 way to the most recently used ntile and then have to do n passes
 around the clock and have no good information about which buffer to
 evict.

Right.

 What I'm saying is that we should demote a buffer every time we
 promote a buffer. So every time we pin a buffer we should advance the
 clock a corresponding amount. I know I'm being intentionally vague
 about what the corresponding amount is.) The important thing is that
 the two should be tied together.

Yes, absolutely.  If you tilt your head the right way, my proposal of
limiting the number of promotions per clock sweep has the effect of
tying buffer demotion and buffer promotion together much more tightly
than is the case right now.  You are limited to 2 promotions per
demotion; and practically speaking not all buffers eligible to be
promoted will actually get accessed, so the number of promotions per
demotion will in reality be somewhere between 0 and 2.  Ideally it
would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
at present.  Which is not to say there isn't some other idea that is
better still.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Robert Haas
On Mon, Apr 20, 2015 at 11:00 AM, Merlin Moncure mmonc...@gmail.com wrote:
 Hmm, interesting point.  It's possible that we'd still have problems
 with everything maxing out at 32 on some workloads, but at least it'd
 be a little harder to max out at 32 than at 5.

 Do we have any reproducible test cases to evaluate these assumptions?

The particular point that you are responding to here is easy to
reproduce.  Just create any workload that fits in shared_buffers - a
small pgbench database, for example - and access stuff.  No usage
counts will go down, but every access will drive usage counts up.
Eventually everything will hit any maximum you care to install, and
actually I don't think it takes very long.  You can use pg_buffercache
to see the results.

  I haven't looked at this stuff for a while, but my main issue with
 the clock sweep was finding sweep heavy cases that did not also have
 trouble with the buffer mapping locks (although the facts on the
 ground my have changed since then).

We increased the number of buffer mapping locks to 128, so that
problem should be considerably ameliorated now.  But it's not totally
gone, as demonstrated by Andres's experiments with my chash patch.

There was also a patch that eliminated BufFreelistLock in favor of a
spinlock held for much shorter time periods, and then Andres took that
one step further and made it use atomics.  That used to be a
*terrible* bottleneck on eviction-heavy workloads and is now much
improved.  More work may remain to be done, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-20 Thread Jim Nasby

On 4/20/15 11:11 AM, Robert Haas wrote:

On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote:

This is my point though (you're right that flushed isn't always the
same as eviction but that's not the important point here). Right now
we only demote when we consider buffers for eviction. But we promote
when we pin buffers. Those two things aren't necessarily happening at
the same rate and in fact are often orders of magnitude different.


I am absolutely, positively, violently in 100% agreement with this.  I
have made the same point before, but it sure is nice to hear someone
else thinking about it the same way.


+1


What I'm saying is that we should demote a buffer every time we
promote a buffer. So every time we pin a buffer we should advance the
clock a corresponding amount. I know I'm being intentionally vague
about what the corresponding amount is.) The important thing is that
the two should be tied together.


Yes, absolutely.  If you tilt your head the right way, my proposal of
limiting the number of promotions per clock sweep has the effect of
tying buffer demotion and buffer promotion together much more tightly
than is the case right now.  You are limited to 2 promotions per
demotion; and practically speaking not all buffers eligible to be
promoted will actually get accessed, so the number of promotions per
demotion will in reality be somewhere between 0 and 2.  Ideally it
would be exactly 1, but 1 +/- 1 is still a tighter limit than we have
at present.  Which is not to say there isn't some other idea that is
better still.


I think that would help, but it still leaves user backends trying to 
advance the clock, which is quite painful. Has anyone tested running the 
clock in the background? We need a wiki page with all the ideas that 
have been tested around buffer management...

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-15 Thread Amit Kapila
On Wed, Apr 15, 2015 at 10:07 AM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  IIUC, this will allow us to increase usage count only when the buffer
  is touched by clocksweep to decrement the usage count.

 Yes.

  I think such a solution will be good for the cases when many evictions
  needs to be performed to satisfy the workload,  OTOH when there are
  not too many evictions that needs to be done, in such a case some of
  the buffers that are accessed much more will have equal probability to
  get evicted as compare to buffers which are less accessed.

 Possibly, but I think it's even worse under the current algorithm.
 Under this proposal, if we go for a long time without any buffer
 evictions, every buffer usage's count will top out at 2 more than
 wherever it was after the last clock sweep.   In the worst case, every
 buffer (or most of them) could end up with the same usage count.  But
 under the status quo, they'll all go to 5, which is an even bigger
 loss of information, and which will make the first eviction much more
 expensive than if they are all pegged at 2 or 3.


Okay, got your point.  On further thinking on this idea, it occurred to me
that with this idea you are trying to reduce the clock-sweep time which
in-turn will inturn improve the chances of finding usable buffer in less
time under high pressure, if that is true then I think we should have seen
the similar gain even with the bgreclaimer idea which I have tried sometime
back, because that has also reduced the time for clock-sweep.

 There could be ways to improve things further, such as by slowly
 advancing the clock sweep even when no eviction is required.  But
 that's a bit tricky to do, too.  You'd like (perhaps) to advance it
 one step for every buffer allocation, but you don't want a centralized
 counter, or unnecessary contention on the clock sweep when no eviction
 is necessary in the first place.  There's probably some way to make it
 work, though, if we put our mind to it.  I'm inclined to try the
 simpler approach first and see what it buys.


Sure, I think it is worth a try.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-15 Thread Martijn van Oosterhout
On Wed, Apr 15, 2015 at 12:37:44AM -0400, Robert Haas wrote:
  I think such a solution will be good for the cases when many evictions
  needs to be performed to satisfy the workload,  OTOH when there are
  not too many evictions that needs to be done, in such a case some of
  the buffers that are accessed much more will have equal probability to
  get evicted as compare to buffers which are less accessed.
 
 Possibly, but I think it's even worse under the current algorithm.
 Under this proposal, if we go for a long time without any buffer
 evictions, every buffer usage's count will top out at 2 more than
 wherever it was after the last clock sweep.   In the worst case, every
 buffer (or most of them) could end up with the same usage count.  But
 under the status quo, they'll all go to 5, which is an even bigger
 loss of information, and which will make the first eviction much more
 expensive than if they are all pegged at 2 or 3.

I've been following this thread from the side with interest and got
twigged by the point about loss of information.  If you'd like better
information about relative ages, you can acheive this by raising the
cap on the usage count and dividing (or right-shifting) each sweep.

This would allow you to remember much more about about the relative
worth of often used pages.  With a cap of 32 you'd have the same effect
as now where after 5 sweeps the buffer is evicted.  Mathematically the
count would converge to the number of times the block is used per
sweep.

If you wanted to be really clever, you could at the beginning of each
sweep take an estimate of the number of buffers used since the last
sweep (from the stats collector perhaps) and use that to drive your
divisor, so if you have a lots of allocations you become more
aggressive about reducing the counts.  Or if the load is light fall
back to just subtracting one.  Then you don't need a cap at all.

(Apologies if this has been suggested before, Google didn't find
anything for me).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-15 Thread Greg Stark
On Wed, Apr 15, 2015 at 5:26 AM, Robert Haas robertmh...@gmail.com wrote:
 The way our cache works we promote when a buffer is accessed but we only
 demote when a buffer is flushed. We flush a lot less often than we touch
 buffers so it's not surprising that the cache ends up full of buffers that
 are all in the most recently used section.

 This isn't really correct.  We promote when it's accessed, but we
 demote it when the clock sweep hand passes over it, which happens each
 time we consider it for eviction.  It does not have to do with
 flushing dirty date to disk, and it does not happen only when the
 buffer is actually evicted.


This is my point though (you're right that flushed isn't always the
same as eviction but that's not the important point here). Right now
we only demote when we consider buffers for eviction. But we promote
when we pin buffers. Those two things aren't necessarily happening at
the same rate and in fact are often orders of magnitude different. So
it makes sense that we end up with a lot of buffers promoted all the
way to the most recently used ntile and then have to do n passes
around the clock and have no good information about which buffer to
evict.

What I'm saying is that we should demote a buffer every time we
promote a buffer. So every time we pin a buffer we should advance the
clock a corresponding amount. I know I'm being intentionally vague
about what the corresponding amount is.) The important thing is that
the two should be tied together.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Robert Haas
On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan p...@heroku.com wrote:
 Why is that good?

We did discuss this before.  I've recapped some of what I believe to
be the most salient points below.

 I think that people were all too quick to dismiss the idea of a wall
 time interval playing some role here (at least as a defense against
 correlated references, as a correlated reference period). I suppose
 that that's because it doesn't fit with an intuition that says that
 that kind of interval ought to be derived algebraically - magic delay
 settings are considered suspect.

Yep, Tom gave that reason here:

http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us

But there was also this point from Andres - gettimeofday is not free:

http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de

And this point from me - this can degrade to random eviction under
high pressure:

http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com

You'll notice that my proposal avoids all three of those objections.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Amit Kapila
On Wed, Apr 15, 2015 at 2:55 AM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Merlin Moncure mmonc...@gmail.com writes:
  Anyways, I'm still curious if you can post similar numbers basing the
  throttling on gross allocation counts instead of time.  Meaning: some
  number of buffer allocations has to have occurred before you consider
  eviction.  Besides being faster I think it's a better implementation:
  an intermittently loaded server will give more consistent behavior.
 
  Yeah --- I think wall-clock-based throttling is fundamentally the wrong
  thing anyway.  Are we going to start needing a CPU speed measurement to
  tune the algorithm with?  Not the place to be going.  But driving it off
  the number of allocations that've been done could be sensible.  (OTOH,
  that means you need a central counter, which itself would be a
  bottleneck.)

 So, I was thinking about this a little bit more today, prodded by my
 coworker John Gorman.  I'm wondering if we could drive this off of the
 clock sweep; that is, every time the clock sweep touches a buffer, its
 usage count goes down by one, but we also set two flag bits.  Every
 time the buffer gets touched thereafter, we check whether any flag
 bits are set; if so, we clear one and increase the usage count, else
 we do nothing.  So the usage count can increase at most twice per
 clock sweep.  The advantage of that is that, as with Peter's approach,
 it is harder for the usage count of a buffer to max out - to get
 there, you need sustained access over a longer period of time.  But
 it's not time-dependent, so replaying the same workload at twice the
 speed or half the speed on faster or slower hardware doesn't change
 the choice of which buffer to evict, which seems good.  And it will
 cause us to prefer buffers which are accessed repeatedly over a period
 of time rather than buffers that are accessed a bunch of times in
 quick succession and then not touched again for a while, which seems
 like a good bet.


IIUC, this will allow us to increase usage count only when the buffer
is touched by clocksweep to decrement the usage count.
I think such a solution will be good for the cases when many evictions
needs to be performed to satisfy the workload,  OTOH when there are
not too many evictions that needs to be done, in such a case some of
the buffers that are accessed much more will have equal probability to
get evicted as compare to buffers which are less accessed.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Robert Haas
On Tue, Apr 14, 2015 at 7:10 PM, Greg Stark st...@mit.edu wrote:
 The way the clock sweep algorithm is meant to be thought about is that it's
 an approximate lru. Each usage count corresponds to an ntile of the lru. So
 we don't know which buffer is least recently used but it must be in the set
 of buffers with usage count 0 and that should be 1/nth of all the buffers.

Agreed.

 In order for that property to be maintained though the usage count for some
 buffer should be getting decremented every time we touch a buffer. That is,
 every time we promote one buffer to the most recently moved ntile we should
 be demoting some other buffer.

Agreed.  It's not easy to get this behavior exactly, though, because
the buffer you kick out necessarily has a usage count of 0 and the one
you bring in probably shouldn't.  And we don't wanna have to run the
clock sweep every time somebody touches a non-maximal usage count.
But I think it is still true that this is, to some degree, what our
algorithm is trying to approximate, and I also think it's pretty clear
that our current approximation isn't that great.

 The way our cache works we promote when a buffer is accessed but we only
 demote when a buffer is flushed. We flush a lot less often than we touch
 buffers so it's not surprising that the cache ends up full of buffers that
 are all in the most recently used section.

This isn't really correct.  We promote when it's accessed, but we
demote it when the clock sweep hand passes over it, which happens each
time we consider it for eviction.  It does not have to do with
flushing dirty date to disk, and it does not happen only when the
buffer is actually evicted.

 Now it's complicated by the fact that we aren't promoting buffers directly
 to the most recently used ntile. We're incrementing the usage count by one.
 That makes it more of a least frequently used list rather than a lru. I
 think that's a mistake but I recall some debate about that when it first
 went in.

Note that the discussion of 2Q, LRU(k), and perhaps others ask not
only how recently the page was used, but how frequently it was used.
The recency of the next-to-last access is often used as a proxy for
frequency.  Consider two buffers. One gets touched 5 times in a row,
once a day.  The other gets touched 5 times per day, at equal
intervals.  In general, the second buffer is a better choice to retain
than the first, even if it has been touched less recently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Robert Haas
On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 IIUC, this will allow us to increase usage count only when the buffer
 is touched by clocksweep to decrement the usage count.

Yes.

 I think such a solution will be good for the cases when many evictions
 needs to be performed to satisfy the workload,  OTOH when there are
 not too many evictions that needs to be done, in such a case some of
 the buffers that are accessed much more will have equal probability to
 get evicted as compare to buffers which are less accessed.

Possibly, but I think it's even worse under the current algorithm.
Under this proposal, if we go for a long time without any buffer
evictions, every buffer usage's count will top out at 2 more than
wherever it was after the last clock sweep.   In the worst case, every
buffer (or most of them) could end up with the same usage count.  But
under the status quo, they'll all go to 5, which is an even bigger
loss of information, and which will make the first eviction much more
expensive than if they are all pegged at 2 or 3.

There could be ways to improve things further, such as by slowly
advancing the clock sweep even when no eviction is required.  But
that's a bit tricky to do, too.  You'd like (perhaps) to advance it
one step for every buffer allocation, but you don't want a centralized
counter, or unnecessary contention on the clock sweep when no eviction
is necessary in the first place.  There's probably some way to make it
work, though, if we put our mind to it.  I'm inclined to try the
simpler approach first and see what it buys.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Jim Nasby

On 4/14/15 5:22 PM, Peter Geoghegan wrote:

As long as we're doing random brainstorming, I'd suggest looking at
making clocksweep actually approximate LRU-K/LRU-2 (which, again, to
be clear, my prototype did not do). The clocksweep could maintain
statistics about the recency of the second-to-last access across all
buffers, and discriminate against buffers according to what bucket of
the population they fit in to. Not sure how aggressively we'd penalize
those buffers that had very old penultimate references (or credit
those that had very recent penultimate references), or what the bucket
partitioning scheme is, but that's probably where'd I'd take it next.
For example, buffers with a penultimate reference that is more than a
standard deviation below the mean would be double penalized (and maybe
the opposite, for those buffers with penultimate accesses a stddev
above the mean). If that didn't work so well, then I'd look into an
ARC style recency and frequency list (while remembering things about
already evicted blocks, which LRU-K does not doalthough that paper
is from the early 1990s).


Along the lines of brainstorming... why do we even allow usage_count  
1? Clocksweep was used pretty successfully by at least FreeBSD, but they 
simply used a bit to indicate recently used. Anything that wasn't 
recently used moved from the active pull to the inactive pool (which 
tended to be far larger than the active pool with decent amounts of 
memory), and a small number of buffers were keep on the 'free' list by 
pulling them out of the inactive pool and writing them if they were 
dirty. All of this was done on an LRU basis.


Given how common it is for the vast bulk of shared_buffers in an install 
to be stuck at 5, I'd think the first thing we should try is a 
combination of greatly reducing the max for usage_count (maybe to 2 
instead of 1 to simulate 2 pools), and running the clock sweep a lot 
more aggressively in a background process.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Peter Geoghegan
On Tue, Apr 14, 2015 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote:
 So, I was thinking about this a little bit more today, prodded by my
 coworker John Gorman.  I'm wondering if we could drive this off of the
 clock sweep; that is, every time the clock sweep touches a buffer, its
 usage count goes down by one, but we also set two flag bits.  Every
 time the buffer gets touched thereafter, we check whether any flag
 bits are set; if so, we clear one and increase the usage count, else
 we do nothing.  So the usage count can increase at most twice per
 clock sweep.  The advantage of that is that, as with Peter's approach,
 it is harder for the usage count of a buffer to max out - to get
 there, you need sustained access over a longer period of time.  But
 it's not time-dependent, so replaying the same workload at twice the
 speed or half the speed on faster or slower hardware doesn't change
 the choice of which buffer to evict, which seems good.

Why is that good?

My little prototype basically implemented the LRU-K approach to
preventing correlated references (which tpc-b has some of, in a really
obvious way - not sure how significant that is). This would have
accidentally made it weight frequency better, but the prototype did
not pretend to actually implement something like LRU-K (which is not
the state of the art in any case), because it did not consider the
recency of the second-to-last access of the buffer at all.

The prototype's performance started off well, but regressed in later
pgbench-collector iterations (due to the influence of VACUUM, I
think). If I wanted to cheat, I could have only done one large 45
minute run, which would have made the prototype look much better
still.

 And it will
 cause us to prefer buffers which are accessed repeatedly over a period
 of time rather than buffers that are accessed a bunch of times in
 quick succession and then not touched again for a while, which seems
 like a good bet.

I think that people were all too quick to dismiss the idea of a wall
time interval playing some role here (at least as a defense against
correlated references, as a correlated reference period). I suppose
that that's because it doesn't fit with an intuition that says that
that kind of interval ought to be derived algebraically - magic delay
settings are considered suspect. However, the same can be said of the
Five-minute rule, which is a highly influential rule of thumb that
Jim Gray came up with. The Five minute rule is now obsolete, but that
took a long time, and the fundamental observation still applies
(Wikipedia says it depends on what type of disks you have these days,
but the fact remains that rule of thumbs like this can be more or less
robust).

I think that correlated reference type delays *are* used effectively
in real world systems without it being fragile/overly workload
dependent.

 I can't convince myself that this fully solves the (currently
 existing) problem of usage counts increasing too fast.  In theory,
 every time the clock sweep hand advances, it decrements the usage
 count of one buffer by one, but also makes it possible for the usage
 count of that buffer to increase by up to two (or by only one if the
 buffer previously had a usage count of five).  So with the right
 access pattern, it seems like you could still get to a place where a
 typical buffer eviction has to do a lot of scanning before it finds a
 victim buffer.  As with the present algorithm, we'd be most likely to
 have that problem when the buffer hit ratio is very high, so that
 there are many opportunities for usage counts to go up between each
 opportunity to push usage counts back down.  But it seems like it
 would at least limit the damage.

 I haven't tried this or anything, so this is just random brainstorming.

That's what it'll take, I think -- random brainstorming, and crude
prototypes to test theories.

As long as we're doing random brainstorming, I'd suggest looking at
making clocksweep actually approximate LRU-K/LRU-2 (which, again, to
be clear, my prototype did not do). The clocksweep could maintain
statistics about the recency of the second-to-last access across all
buffers, and discriminate against buffers according to what bucket of
the population they fit in to. Not sure how aggressively we'd penalize
those buffers that had very old penultimate references (or credit
those that had very recent penultimate references), or what the bucket
partitioning scheme is, but that's probably where'd I'd take it next.
For example, buffers with a penultimate reference that is more than a
standard deviation below the mean would be double penalized (and maybe
the opposite, for those buffers with penultimate accesses a stddev
above the mean). If that didn't work so well, then I'd look into an
ARC style recency and frequency list (while remembering things about
already evicted blocks, which LRU-K does not doalthough that paper
is from the early 1990s).

There are approaches to relieving lock 

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Robert Haas
On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 Anyways, I'm still curious if you can post similar numbers basing the
 throttling on gross allocation counts instead of time.  Meaning: some
 number of buffer allocations has to have occurred before you consider
 eviction.  Besides being faster I think it's a better implementation:
 an intermittently loaded server will give more consistent behavior.

 Yeah --- I think wall-clock-based throttling is fundamentally the wrong
 thing anyway.  Are we going to start needing a CPU speed measurement to
 tune the algorithm with?  Not the place to be going.  But driving it off
 the number of allocations that've been done could be sensible.  (OTOH,
 that means you need a central counter, which itself would be a
 bottleneck.)

So, I was thinking about this a little bit more today, prodded by my
coworker John Gorman.  I'm wondering if we could drive this off of the
clock sweep; that is, every time the clock sweep touches a buffer, its
usage count goes down by one, but we also set two flag bits.  Every
time the buffer gets touched thereafter, we check whether any flag
bits are set; if so, we clear one and increase the usage count, else
we do nothing.  So the usage count can increase at most twice per
clock sweep.  The advantage of that is that, as with Peter's approach,
it is harder for the usage count of a buffer to max out - to get
there, you need sustained access over a longer period of time.  But
it's not time-dependent, so replaying the same workload at twice the
speed or half the speed on faster or slower hardware doesn't change
the choice of which buffer to evict, which seems good.  And it will
cause us to prefer buffers which are accessed repeatedly over a period
of time rather than buffers that are accessed a bunch of times in
quick succession and then not touched again for a while, which seems
like a good bet.

I can't convince myself that this fully solves the (currently
existing) problem of usage counts increasing too fast.  In theory,
every time the clock sweep hand advances, it decrements the usage
count of one buffer by one, but also makes it possible for the usage
count of that buffer to increase by up to two (or by only one if the
buffer previously had a usage count of five).  So with the right
access pattern, it seems like you could still get to a place where a
typical buffer eviction has to do a lot of scanning before it finds a
victim buffer.  As with the present algorithm, we'd be most likely to
have that problem when the buffer hit ratio is very high, so that
there are many opportunities for usage counts to go up between each
opportunity to push usage counts back down.  But it seems like it
would at least limit the damage.

I haven't tried this or anything, so this is just random brainstorming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-14 Thread Greg Stark
I've been meaning to write this since PGConf and now isn't a great time
since I'm on my phone but I think it's time.

The way the clock sweep algorithm is meant to be thought about is that it's
an approximate lru. Each usage count corresponds to an ntile of the lru. So
we don't know which buffer is least recently used but it must be in the set
of buffers with usage count 0 and that should be 1/nth of all the buffers.

In order for that property to be maintained though the usage count for some
buffer should be getting decremented every time we touch a buffer. That is,
every time we promote one buffer to the most recently moved ntile we should
be demoting some other buffer.

The way our cache works we promote when a buffer is accessed but we only
demote when a buffer is flushed. We flush a lot less often than we touch
buffers so it's not surprising that the cache ends up full of buffers that
are all in the most recently used section.

Now it's complicated by the fact that we aren't promoting buffers directly
to the most recently used ntile. We're incrementing the usage count by one.
That makes it more of a least frequently used list rather than a lru. I
think that's a mistake but I recall some debate about that when it first
went in.


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-05-01 Thread Kevin Grittner
Jim Nasby j...@nasby.net wrote:

 In our case this could maybe be handled by simply not
 incrementing counts when there's no eviction... but I'm more a
 fan of separate polls/clocks, because that means you can do
 things like a LFU for active and an LRU for inactive.

I have hesitated to mention some benchmarks I did for optimal
caching techniques for a database load, because they were so old,
but maybe the ideas might spark something of value in the
discussion.  I'm talking about early 1985 on 80286 hardware on DOS
with a Terminate and Stay Resident (TSR) cache external to the
database.  The external cache used LRU caching, and I was looking
at what caching I could do inside the database to improve real
database workloads which tended to include both OLTP and reporting.

I found two types of caches improved performance.  Neither was a
substitute for the LRU cache closer to the hardware, and
eliminating either reduced performance over having both.  One was
index-specific -- each connection caused to be held in cache the
last page at each level of the index.  This proved useful because
in our real life applications it turned out that the next random
access on an index was very often the same or near the previous. 
The other was a weighted average of access counts -- each access
bumped a count and after a certain number of bumps all counts were
reduced by 25%.  This was accomplished by setting each count to the
sum of it's existing value shifted right by one and shifted right
by two.

I understand that with the much larger RAM caches available 30
years later there could be some problems with passing all the
counts atomically without causing a noticeable pause, and the
higher connection counts may cause more contention issues.  But if
those issues could be solved (or somehow dodged for a proof of
concept benchmark) it might be interesting to see how that worked
out.

FWIW, I recall that we used a one byte counter for each page,
running 0 to 255.  I don't recall the number at which we
effectively multiplied by 0.75, and there was nothing particularly
magic about that multiplier other than it was pretty fast and
worked better that 0.5 in my benchmarks.  I also don't remember
what we used as the initial value on a page load.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-05-01 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:

 each connection caused to be held in cache the last page at each
 level of the index.

Apologies for ambiguous terminology there.

To be clear: the most recently accessed page at each level of the index.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-28 Thread Robert Haas
On Fri, Apr 18, 2014 at 11:46 AM, Greg Stark st...@mit.edu wrote:
 On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas robertmh...@gmail.com wrote:
 I am a bit confused by this remark.  In *any* circumstance when you
 evict you're incurring precisely one page fault I/O when the page is
 read back in.   That doesn't mean that the choice of which page to
 evict is irrelevant.

 But you might be evicting a page that will be needed soon or one that
 won't be needed for a while. If it's not needed for a while you might
 be able to avoid many page evictions by caching a page that will be
 used several times.

Sure.

 If all the pages currently in RAM are hot -- meaning they're hot
 enough that they'll be needed again before the page you're reading in
 -- then they're all equally bad to evict.

Also true.  But the problem is that it is very rarely, if ever, the
case that all pages are *equally* hot.  On a pgbench workload, for
example, I'm very confident that while there's not really any cold
data, the btree roots and visibility map pages are a whole lot hotter
than a randomly-selected heap page.  If you evict a heap page, you're
going to need it back pretty quick, because it won't be long until the
random-number generator again chooses a key that happens to be located
on that page.  But if you evict the root of the btree index, you're
going to need it back *immediately*, because the very next query, no
matter what key it's looking for, is going to need that page.  I'm
pretty sure that's a significant difference.

 I'm trying to push us away from the gut instinct that frequently used
 pages are important to cache and towards actually counting how many
 i/os we're saving. In the extreme it's possible to simulate any cache
 algorithm on a recorded list of page requests and count how many page
 misses it generates to compare it with an optimal cache algorithm.

There's another issue, which Simon clued me into a few years back:
evicting the wrong page can cause system-wide stalls.  In the pgbench
case, evicting a heap page will force the next process that chooses a
random number that maps to a tuple on that page to wait for the page
to be faulted back in.  That's sad, but unless the scale factor is
small compared to the number of backends, there will probably be only
ONE process waiting.  On the other hand, if we evict the btree root,
within a fraction of a second, EVERY process that isn't already
waiting on some other I/O will be waiting for that I/O to complete.
The impact on throughput is much bigger in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-28 Thread Robert Haas
On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby j...@nasby.net wrote:
 I feel that if there is no memory pressure, frankly it doesnt matter much
 about what gets out and what not. The case I am specifically targeting is
 when the clocksweep gets to move about a lot i.e. high memory pressure
 workloads. Of course,  I may be totally wrong here.

 Well, there's either memory pressure or there isn't. If there isn't then
 it's all moot *because we're not evicting anything*.

I don't think that's really true.  A workload can fit within
shared_buffers at some times and spill beyond it at others.  Every
time it fits within shared_buffers for even a short period of time,
the reference count of any buffer that's not ice-cold goes to 5 and we
essentially lose all knowledge of which buffers are relatively hotter.
 Then, when we spill out again, evictions are random.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-28 Thread Peter Geoghegan
On Mon, Apr 28, 2014 at 6:02 AM, Robert Haas robertmh...@gmail.com wrote:
 Also true.  But the problem is that it is very rarely, if ever, the
 case that all pages are *equally* hot.  On a pgbench workload, for
 example, I'm very confident that while there's not really any cold
 data, the btree roots and visibility map pages are a whole lot hotter
 than a randomly-selected heap page.  If you evict a heap page, you're
 going to need it back pretty quick, because it won't be long until the
 random-number generator again chooses a key that happens to be located
 on that page.  But if you evict the root of the btree index, you're
 going to need it back *immediately*, because the very next query, no
 matter what key it's looking for, is going to need that page.  I'm
 pretty sure that's a significant difference.

I emphasized leaf pages because even with master the root and inner
pages are still going to be so hot as to make them constantly in
cache, at least with pgbench's use of a uniform distribution. You'd
have to have an absolutely enormous scale factor before this might not
be the case. As such, I'm not all that worried about inner pages when
performing these simple benchmarks. However, in the case of the
pgbench_accounts table, each of the B-Tree leaf pages that comprise
about 99.5% of the total is still going to be about six times more
frequently accessed than each heap page. That's a small enough
difference for it to easily go unappreciated, and yet a big enough
difference for it to hurt a lot.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-28 Thread Peter Geoghegan
On Fri, Apr 25, 2014 at 10:45 AM, Peter Geoghegan p...@heroku.com wrote:
 I've now done a non-limited comparative benchmark of master against
 the patch (once again, with usage_count starting at 6, and
 BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again,
 the distribution threshold used was consistently 5.0, causing the
 patched pgbench to report for each test:

 transaction type: Custom query
 scaling factor: 5000
 standard deviation threshold: 5.0
 access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741

 Results are available from:

 http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/

I updated this with various changes in bgwriter configuration. Perhaps
unsurprisingly, disabling the background writer entirely helps for
both master and patched. It is perhaps notable that the largest
difference between two comparable patch + master test runs is seen
when the background writer is disabled entirely, and 32 clients.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-28 Thread Jim Nasby

On 4/28/14, 8:04 AM, Robert Haas wrote:

On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby j...@nasby.net wrote:

I feel that if there is no memory pressure, frankly it doesnt matter much
about what gets out and what not. The case I am specifically targeting is
when the clocksweep gets to move about a lot i.e. high memory pressure
workloads. Of course,  I may be totally wrong here.


Well, there's either memory pressure or there isn't. If there isn't then
it's all moot *because we're not evicting anything*.


I don't think that's really true.  A workload can fit within
shared_buffers at some times and spill beyond it at others.  Every
time it fits within shared_buffers for even a short period of time,
the reference count of any buffer that's not ice-cold goes to 5 and we
essentially lose all knowledge of which buffers are relatively hotter.
  Then, when we spill out again, evictions are random.


That's a separate problem, but yes, just because we're not evicting something 
doesn't mean we can end up with every buffer marked as equally important.

OSes handle this by splitting pages between active and inactive, and 
maintaining a relative balance between the two (actually a bit more complex 
because there's a separate inactive/dirty pool).

In our case this could maybe be handled by simply not incrementing counts when 
there's no eviction... but I'm more a fan of separate polls/clocks, because 
that means you can do things like a LFU for active and an LRU for inactive.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-25 Thread Peter Geoghegan
I've now done a non-limited comparative benchmark of master against
the patch (once again, with usage_count starting at 6, and
BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again,
the distribution threshold used was consistently 5.0, causing the
patched pgbench to report for each test:

transaction type: Custom query
scaling factor: 5000
standard deviation threshold: 5.0
access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741

Results are available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-24 Thread Peter Geoghegan
On Mon, Apr 21, 2014 at 11:57 PM, Peter Geoghegan p...@heroku.com wrote:
 Here is a benchmark that is similar to my earlier one, but with a rate
 limit of 125 tps, to help us better characterize how the prototype
 patch helps performance:

 http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/

I've added some more test sets to this result report, again with a 125
TPS limit, but on this occasion with a pgbench Gaussian distribution.
I used V13 of the recently proposed Guassian distribution pgbench
patch [1] to accomplish this, including the Gaussian variant of tpc-b
that the pgbench patch has baked in. The distribution threshold used
was consistently 5, causing the patched pgbench to report for each
test:

transaction type: Custom query
scaling factor: 5000
standard deviation threshold: 5.0
access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741

It looks like the patch continues to have much lower latency than
master for this somewhat distinct workload. Actually, even though the
background writer is somewhat working harder than in the uniform
distribution case, the average latency with patched is appreciably
lower. Total buffers allocated are just as consistent as before for
patched, but the number is markedly lower than for the prior uniform
distribution case. Dirty memory graphs start off similar to the
uniform case with patched, but get a bit spikier towards the end of
each test run there. It's still *markedly* better than master for
either distribution type, which is still really aggressive at times
for master, and other times by far isn't aggressive enough, in much
the same way as before.

In general, with the Gaussian distribution, average latency is lower,
but worst case is higher. The patch maintains its clear lead for
average case, albeit a smaller lead than with uniform, and with worst
case things are much better relatively speaking. Absolute worst case
(and not worst case averaged across client counts) is 1.4 seconds with
patched, to 8.3 with master...and that terrible worst case happens
*twice* with master. For uniform distribution, the same figure was 5.4
- 5.8 seconds for master, and 0.6 seconds for patched.

What is curious is that with master and with the Gaussian
distribution, I see distinct latency no man's land in multiple test
runs, like this one here for example:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/49/index.html
. It looks like there is a clear differentiation between going to disk
and not going to disk, or something like that. I don't see this for
any other case, and it's quite obviously a consistent and distinct
feature of master + Gaussian when the OS isn't aggressively writing
out a mountain of dirty memory. This is something that I personally
have never seen before.

I also note that master had 3 huge background writer spikes with a
Gaussian distribution, rather than 2 and 1 small one, as was
(consistently) demonstrated to happen with a uniform distribution.
What's more, 90th percentile latency is very consistent across client
counts for the new patched test run, as opposed to being very much
higher with higher client counts when master is tested.

[1] http://www.postgresql.org/message-id/alpine.DEB.2.10.1404011107220.2557@sto
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-22 Thread Peter Geoghegan
Here is a benchmark that is similar to my earlier one, but with a rate
limit of 125 tps, to help us better characterize how the prototype
patch helps performance:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/

Again, these are 15 minute runs with unlogged tables at multiple
client counts, and scale 5,000.

Every test run should have managed to hit that limit, but in the case
of master 1 test did not. I have included vmstat, iostat and meminfo
OS instrumentation this time around, which is really interesting for
this particular limit based benchmark. The prototype patch tested here
is a slight refinement on my earlier prototype. Apart from reducing
the number of gettimeofday() calls, and doing them out of the critical
path, I increased the initial usage_count value to 6. I also set
BM_MAX_USAGE_COUNT to 30. I guess I couldn't resist the temptation to
tweak things, which I actually did very little of prior to publishing
my initial results. This helps, but there isn't a huge additional
benefit.

The benchmark results show that master cannot even meet the 125 tps
limit on 1 test out of 9. More interestingly, the background writer
consistently cleans about 10,000 buffers per test run when testing the
patch. At the same time, buffers are allocated at a very consistent
rate of around 262,000 for the patched test runs. Leaving aside the
first test run, with the patch there is only a tiny variance in the
number cleaned between each test, a variance of just a few hundred
buffers. In contrast, master has enormous variance. During just over
half of the tests, the background writer does not clean even a single
buffer. Then, on 2 tests out of 9, it cleans an enormous ~350,000
buffers. The second time this happens leads to master failing to even
meet the 125 tps limit (albeit with only one client).

If you drill down to individual test runs, a similar pattern is
evident. You'll now find operating system information (meminfo dirty
memory) graphed here. The majority of the time, master does not hold
more than 1,000 kB of dirty memory at a time. Once or twice it's 0 kB
for multiple minutes. However, during the test runs where we also see
huge spikes in background-writer-cleaned pages, we also see huge
spikes in the total amount of dirty memory (and correlated huge spikes
in latency). It can get to highs of ~700,000 kB at one point. In
contrast, the patched tests show very consistent amounts of dirty
memory. Per test, it almost always tops out at 4,000 kB - 6,000 kB
(there is a single 12,000 kB spike, though). There is consistently a
distinct zig-zag pattern to the dirty memory graph with the patched
tests, regardless of client count or where checkpoints occur. Master
shows mountains and valleys for those two particularly problematic
tests, correlating with a panicked background writer's aggressive
feedback loop. Master also shows less rhythmic zig-zag patterns that
only peak at about 600 kB - 1,000 kB for the entire duration of many
individual test runs.

Perhaps most notably, average and worst case latency is far improved
with the patch. On average it's less than half of master with 1
client, and less than a quarter of master with 32 clients.

I think that the rate limiting feature of pgbench is really useful for
characterizing how work like this improves performance. I see a far
smoother and more consistent pattern of I/O that superficially looks
like Postgres is cooperating with the operating system much more than
it does in the baseline. It sort of makes sense that the operating
system cache doesn't care about frequency while Postgres does. If the
OS cache did weigh frequency, it would surely not alter the outcome
very much, since OS cached data has presumably not been accessed very
frequently recently. I suspect that I've cut down on double buffering
by quite a bit. I would like to come up with a simple way of measuring
that, using something like pgfincore, but the available interfaces
don't seem well-suited to quantifying how much of a problem this is
and remains. I guess call pgfincore on the index segment files might
be interesting, since shared_buffers mostly holds index pages. This
has been verified using pg_buffercache.

It would be great to test this out with something involving
non-uniform distributions, like Gaussian and Zipfian distributions.
The LRU-K paper tests Zipfian too. The uniform distribution pgbench
uses here, while interesting, doesn't tell the full story at all and
is less representative of reality (TPB-C is formally required to have
a non-uniform distribution [1] for some things, for example). A
Gaussian distribution might show essentially the same failure to
properly credit pages with frequency of access in one additional
dimension, so to speak. Someone should probably look at the TPC-C-like
DBT-2, since in the past that was considered to be a problematic
workload for PostgreSQL [2] due to the heavy I/O. Zipfian is a lot
less sympathetic than uniform if the 

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-22 Thread Albe Laurenz
Jason Petersen wrote:
 Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday 
 seems silly to me: it was
 something quick for the prototype.
 
 The problem with the clocksweeps is they don’t actually track the progression 
 of “time” within the
 PostgreSQL system.

Would it make sense to just cache the result of the latest gettimeofday() call
and use that as an approximation for wall time?
The busier the system is, the more accurate that should be.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-22 Thread Atri Sharma
On Tue, Apr 22, 2014 at 12:59 PM, Albe Laurenz laurenz.a...@wien.gv.atwrote:

 Jason Petersen wrote:
  Yes, we obviously want a virtual clock. Focusing on the use of
 gettimeofday seems silly to me: it was
  something quick for the prototype.
 
  The problem with the clocksweeps is they don’t actually track the
 progression of “time” within the
  PostgreSQL system.

 Would it make sense to just cache the result of the latest gettimeofday()
 call
 and use that as an approximation for wall time?
 The busier the system is, the more accurate that should be.


That sounds...risky. How will the invalidation/updation of the cache work?

How will we track the time window in which the cached value is still valid
and applicable?

My first thoughts only. I may be missing the point though.

Regards,

Atri



-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-22 Thread Hannu Krosing
On 04/17/2014 10:39 PM, Andres Freund wrote:
 On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote:
 Just over 99.6% of pages (leaving aside the meta page) in the big 10
 GB pgbench_accounts_pkey index are leaf pages.

What is the depth of b-tree at this percentage ?

Cheers
Hannu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-22 Thread Jim Nasby

On 4/21/14, 6:07 PM, David G Johnston wrote:

Jim Nasby-2 wrote

I feel that if there is no memory pressure, frankly it doesnt matter much
about what gets out and what not. The case I am specifically targeting is
when the clocksweep gets to move about a lot i.e. high memory pressure
workloads. Of course,  I may be totally wrong here.


Well, there's either memory pressure or there isn't. If there isn't then
it's all moot*because we're not evicting anything*.

The trade-off I'm seeing here is between measuring when there is no memory
pressure - and thus eating at performance while not actually evicting
buffers - and not measuring but then encountering memory pressure and not
having a clue as to what should be evicted.


Right. OSes handle this by keeping a certain ratio of active vs inactive pages, 
regardless of pressure for free pages. That way when you need more pages in the 
free list you can pull them from the inactive list knowing that you're making a 
good decision.

One of the really nice things about this approach is that if memory pressure is 
low enough that you don't need more pages on the inactive list you don't even 
need to run that clock.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-22 Thread Peter Geoghegan
On Tue, Apr 22, 2014 at 2:03 AM, Hannu Krosing ha...@krosing.net wrote:
 What is the depth of b-tree at this percentage ?

Well, this percentage of B-Tree pages that are leaf pages doesn't have
much to do with the depth. The percentage seems very consistent for
each B-Tree, irrespective of the total size of the B-Tree.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Jim Nasby

On 4/15/14, 1:15 PM, Peter Geoghegan wrote:

On Tue, Apr 15, 2014 at 9:30 AM, Merlin Moncuremmonc...@gmail.com  wrote:

There are many reports of improvement from lowering shared_buffers.
The problem is that it tends to show up on complex production
workloads and that there is no clear evidence pointing to problems
with the clock sweep; it could be higher up in the partition locks or
something else entirely (like the O/S).  pgbench is also not the
greatest tool for sniffing out these cases: it's too random and for
large database optimization is generally an exercise in de-randomizing
i/o patterns.  We really, really need a broader testing suite that
covers more usage patterns.

I find it quite dissatisfying that we know so little about this.


This is an area where additional stats gathering would be very valuable. We're 
running 8G buffers on 512G servers because bumping it up hurt performance, but 
we have no clue why. Was it due to buffer pins? How many times the clock had to 
sweep to find a victim? Something else entirely? No idea... :(
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Jim Nasby

On 4/18/14, 2:51 PM, Atri Sharma wrote:


I feel that if there is no memory pressure, frankly it doesnt matter much about 
what gets out and what not. The case I am specifically targeting is when the 
clocksweep gets to move about a lot i.e. high memory pressure workloads. Of 
course,  I may be totally wrong here.


Well, there's either memory pressure or there isn't. If there isn't then it's 
all moot *because we're not evicting anything*.


One thing that I discussed with Merlin offline and am now concerned about is 
how will the actual eviction work. We cannot traverse the entire list and then 
find all the buffers with refcount 0 and then do another traversal to find the 
oldest one.


This is why OSes use multiple page pools. If we're going to use a clock sweep 
at all I think we need to use the same.

Every time we discuss this stuff it feels like we're completely reinventing the 
wheel that was solved by OSes years ago. :(
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Jim Nasby

On 4/16/14, 10:28 AM, Robert Haas wrote:

Also, I think the scalability problems around buffer eviction are
eminently solvable, and in particular I'm hopeful that Amit is going
to succeed in solving them.  Suppose we have a background process
(whether the background writer or some other) that runs the clock
sweep, identifies good candidates for eviction, and pushes them on a
set of, say, 16 free-lists protected by spinlocks.  (The optimal
number of free-lists probably depends on the size of shared_buffers.)


How *certain* are we that a single freelist lock (that actually ONLY protects 
the freelist) would be that big a deal? I suspect it wouldn't be much of an 
issue at all:

- Right now (IIRC) it's tied into the clock as well, so immediate fail on 
scaling...
- The clock is WAY more expensive than grabbing one buffer off the free list. 
Last I looked it was so bad that even if the next buffer the clock hit was free 
it was still worse than hitting the free list.

I strongly suspect that a single freelist lock (that didn't protect anything 
else) would be fine. I think it'd be folly to start with a more complex 
multi-lock/multi-freelist implementation before we knew we needed one.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread David G Johnston
Jim Nasby-2 wrote
 I feel that if there is no memory pressure, frankly it doesnt matter much
 about what gets out and what not. The case I am specifically targeting is
 when the clocksweep gets to move about a lot i.e. high memory pressure
 workloads. Of course,  I may be totally wrong here.
 
 Well, there's either memory pressure or there isn't. If there isn't then
 it's all moot *because we're not evicting anything*.

The trade-off I'm seeing here is between measuring when there is no memory
pressure - and thus eating at performance while not actually evicting
buffers - and not measuring but then encountering memory pressure and not
having a clue as to what should be evicted.

David J.






--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Clock-sweep-not-caching-enough-B-Tree-leaf-pages-tp5799947p5800988.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Claudio Freire
On Mon, Apr 21, 2014 at 8:07 PM, David G Johnston
david.g.johns...@gmail.com wrote:
 Jim Nasby-2 wrote
 I feel that if there is no memory pressure, frankly it doesnt matter much
 about what gets out and what not. The case I am specifically targeting is
 when the clocksweep gets to move about a lot i.e. high memory pressure
 workloads. Of course,  I may be totally wrong here.

 Well, there's either memory pressure or there isn't. If there isn't then
 it's all moot *because we're not evicting anything*.

 The trade-off I'm seeing here is between measuring when there is no memory
 pressure - and thus eating at performance while not actually evicting
 buffers - and not measuring but then encountering memory pressure and not
 having a clue as to what should be evicted.


I believe that for the intended use discussed in this thread, a
compile-time switch would be more than enough control, and it would
avoid that tradeoff.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Tom Lane
Jim Nasby j...@nasby.net writes:
 How *certain* are we that a single freelist lock (that actually ONLY
 protects the freelist) would be that big a deal?

We used to have one.  It was a big bottleneck --- and this was years
ago, when the buffer manager was much less scalable than it is today.
(IIRC, getting rid of a central lock was one of the main advantages
of the current clock sweep code over its predecessor.)

The real issue here is that in the modern code, we hardly ever actually
have anything in the freelist: only when a relation is dropped, or
something like that, do buffers ever get put to the freelist.  So your
argument that removing a buffer from the freelist is cheaper than running
the clock sweep is largely missing the point.  We'd have to run a clock
sweep in order to find something to put in the freelist.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Peter Geoghegan
On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 We used to have one.  It was a big bottleneck --- and this was years
 ago, when the buffer manager was much less scalable than it is today.
 (IIRC, getting rid of a central lock was one of the main advantages
 of the current clock sweep code over its predecessor.)

Yes, it was. This is a major advantage of clock sweep, and anything
that replaces it will need to maintain the same advantage. Didn't
someone indicate that clock sweep could beat ARC around that time,
presumably for this reason? If no one did, then my reading of a
variety of other papers on caching indicates that this is probably the
case.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 We used to have one.  It was a big bottleneck --- and this was years
 ago, when the buffer manager was much less scalable than it is today.
 (IIRC, getting rid of a central lock was one of the main advantages
 of the current clock sweep code over its predecessor.)

 Yes, it was. This is a major advantage of clock sweep, and anything
 that replaces it will need to maintain the same advantage. Didn't
 someone indicate that clock sweep could beat ARC around that time,
 presumably for this reason? If no one did, then my reading of a
 variety of other papers on caching indicates that this is probably the
 case.

ARC *was* the predecessor algorithm.  See commit 5d5087363.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Peter Geoghegan
On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ARC *was* the predecessor algorithm.  See commit 5d5087363.

I believe that the main impetus for replacing ARC with clock sweep
came from patent issues, though. It was a happy coincidence that clock
sweep happened to be better than ARC, but that doesn't mean that ARC
didn't have some clear advantages, even if it wasn't worth it on
balance. LRU-K, and 2Q have roughly the same advantages. I'm
reasonably confident you can have the best of both worlds, or
something closer to it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ARC *was* the predecessor algorithm.  See commit 5d5087363.

 I believe that the main impetus for replacing ARC with clock sweep
 came from patent issues, though.

That was one issue, but performance gains were a large part of it too,
and the main reason why we picked clock sweep rather than something else.
Did you read the commit message I pointed to?

(See also 4e8af8d27.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Peter Geoghegan
On Mon, Apr 21, 2014 at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Did you read the commit message I pointed to?

Yes.

 (See also 4e8af8d27.)

Oh, I wasn't actually aware of the fact that 2Q made it into the tree.
I thought that the first commit message you referred to just
referenced on-list discussion of 2Q. Interesting.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-21 Thread Peter Geoghegan
On Mon, Apr 21, 2014 at 5:59 PM, Peter Geoghegan p...@heroku.com wrote:
 LRU-K, and 2Q have roughly the same advantages. I'm
 reasonably confident you can have the best of both worlds, or
 something closer to it.

Having said that, a big part of what I'd like to accomplish here is to
address the more general problem of correlated references. That's
probably something that has independent value.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-19 Thread Atri Sharma
On Sat, Apr 19, 2014 at 3:37 AM, Bruce Momjian br...@momjian.us wrote:


  One thing that I discussed with Merlin offline and am now concerned
 about is
  how will the actual eviction work. We cannot traverse the entire list
 and then
  find all the buffers with refcount 0 and then do another traversal to
 find the
  oldest one.

 I thought if there was memory pressure the clock sweep would run and we
 wouldn't have everything at the max counter access value.


Hmm, I see your point.

With that applicable as well, I feel that the clocksweep counting/logical
clock system shall be useful when deciding between multiple candidates for
eviction. At worst, it can serve to replace the gettimeofday() calls.

One thing I have thought of with ideas and inputs from Joshua Yanowski
offline is that we can probably have a maxheap which is on the logical
clock age of buffers. Each time clocksweep sees a buffer whose refcount has
become zero, it will push the buffer into minheap. This can be a new
representation of freelist or a new additional data structure.

This still does not solve the problem of seeing the entire list by the
clocksweep, even if that makes the eviction process O(1) with the addition
of the maxheap.

I am working on a PoC patch but am stuck on this point. My current approach
sees the entire shared buffers list to search for any candidate buffers.

Another thing that is a pain point here is the concurrency and locking
overheads of introducing a new data structure. Can the existing buffer
header spinlock handle this problem or is it hitting the granularity of the
spinlock too much?

I see some blockers for this idea still. Nevertheless, the point of
clocksweep counts as logical clocks seems to be promising,atleast
intuitively.

Thoughts and comments?

Regards,

Atri


-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Atri Sharma
On Fri, Apr 18, 2014 at 7:27 AM, Peter Geoghegan p...@heroku.com wrote:

A way I have in mind about eviction policy is to introduce a way to have an
ageing factor in each buffer and take the ageing factor into consideration
when evicting a buffer.

Consider a case where a table is pretty huge and spread across multiple
pages. The querying pattern is like a time series pattern i.e. a set of
rows is queried pretty frequently for some time, making the corresponding
page hot. Then, the next set of rows is queried frequently making that page
hot and so on.

Consider a new page entering the shared buffers with refcount 1 and
usage_count 1. If that page is a part of the workload described above, it
is likely that it shall not be used for a considerable amount of time after
it has entered the buffers but will be used eventually.

Now, the current hypothetical situation is that we have three pages:

1) The page that used to be hot at the previous time window but is no
longer hot and is actually the correct candidate for eviction.
2) The current hot page (It wont be evicted anyway for now).
3) The new page which just got in and should not be evicted since it can be
hot soon (for this workload it will be hot in the next time window).

When Clocksweep algorithm runs the next time, it will see the new buffer
page as the one to be evicted (since page (1) may still have usage_count 
0 i.e. it may be 'cooling' but not 'cool' yet.)

This can be changed by introducing an ageing factor that sees how much time
the current buffer has spend in shared buffers. If the time that the buffer
has spent is large enough (relatively) and it is not hot currently, that
means it has had its chance and can be evicted. This shall save the new
page (3) from being evicted since it's time in shared buffers shall not be
high enough to mandate eviction and it shall be given more chances.

Since gettimeofday() is an expensive call and hence cannot be done in the
tight loop, we can count the number of clocksweeps the current buffer has
seen (rather, survived). This shall give us a rough idea of the estimate of
the relative age of the buffer.

When an eviction happens, all the candidates with refcount = 0 shall be
taken.Then, among them, the one with highest ageing factor shall be evicted.

Of course, there may be better ways of doing the same, but I want to
highlight the point (or possibility) of introducing an ageing factor to
prevent eviction of relatively younger pages early in the eviction process.

The overhead isnt too big. We just need to add another attribute in buffer
header for the number of clocksweeps seen (rather, survived) and check it
when an eviction is taking place.The existing spinlock for buffer headers
shall be good for protecting contention and access. The access rules can be
similar to that of usage_count.

Thoughts and comments?

Regards,

Atri

-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Robert Haas
On Thu, Apr 17, 2014 at 5:00 PM, Greg Stark st...@mit.edu wrote:
 On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas robertmh...@gmail.com wrote:
 Because all the usage counts are the same, the eviction at
 this point is completely indiscriminate.  We're just as likely to kick
 out a btree root page or a visibility map page as we are to kick out a
 random heap page, even though the former have probably been accessed
 several orders of magnitude more often.  That's clearly bad.

 That's not clear at all. In that circumstance regardless of what page
 you evict you're incurring precisely one page fault i/o when the page
 is read back in.

I am a bit confused by this remark.  In *any* circumstance when you
evict you're incurring precisely one page fault I/O when the page is
read back in.   That doesn't mean that the choice of which page to
evict is irrelevant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Greg Stark
On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas robertmh...@gmail.com wrote:
 I am a bit confused by this remark.  In *any* circumstance when you
 evict you're incurring precisely one page fault I/O when the page is
 read back in.   That doesn't mean that the choice of which page to
 evict is irrelevant.

But you might be evicting a page that will be needed soon or one that
won't be needed for a while. If it's not needed for a while you might
be able to avoid many page evictions by caching a page that will be
used several times.

If all the pages currently in RAM are hot -- meaning they're hot
enough that they'll be needed again before the page you're reading in
-- then they're all equally bad to evict.

I'm trying to push us away from the gut instinct that frequently used
pages are important to cache and towards actually counting how many
i/os we're saving. In the extreme it's possible to simulate any cache
algorithm on a recorded list of page requests and count how many page
misses it generates to compare it with an optimal cache algorithm.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Bruce Momjian
On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote:
 This can be changed by introducing an ageing factor that sees how much time 
 the
 current buffer has spend in shared buffers. If the time that the buffer has
 spent is large enough (relatively) and it is not hot currently, that means it
 has had its chance and can be evicted. This shall save the new page (3) from
 being evicted since it's time in shared buffers shall not be high enough to
 mandate eviction and it shall be given more chances.
 
 Since gettimeofday() is an expensive call and hence cannot be done in the 
 tight
 loop, we can count the number of clocksweeps the current buffer has seen
 (rather, survived). This shall give us a rough idea of the estimate of the
 relative age of the buffer.

Counting clock sweeps is an intersting idea.  I think one concern was
tracking hot buffers in cases where there is no memory pressure, and
hence the clock sweep isn't running --- I am not sure how this would
help in that case.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Atri Sharma
On Sat, Apr 19, 2014 at 1:07 AM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote:
  This can be changed by introducing an ageing factor that sees how much
 time the
  current buffer has spend in shared buffers. If the time that the buffer
 has
  spent is large enough (relatively) and it is not hot currently, that
 means it
  has had its chance and can be evicted. This shall save the new page (3)
 from
  being evicted since it's time in shared buffers shall not be high enough
 to
  mandate eviction and it shall be given more chances.
 
  Since gettimeofday() is an expensive call and hence cannot be done in
 the tight
  loop, we can count the number of clocksweeps the current buffer has seen
  (rather, survived). This shall give us a rough idea of the estimate of
 the
  relative age of the buffer.

 Counting clock sweeps is an intersting idea.  I think one concern was
 tracking hot buffers in cases where there is no memory pressure, and
 hence the clock sweep isn't running --- I am not sure how this would
 help in that case.


I feel that if there is no memory pressure, frankly it doesnt matter much
about what gets out and what not. The case I am specifically targeting is
when the clocksweep gets to move about a lot i.e. high memory pressure
workloads. Of course,  I may be totally wrong here.

One thing that I discussed with Merlin offline and am now concerned about
is how will the actual eviction work. We cannot traverse the entire list
and then find all the buffers with refcount 0 and then do another traversal
to find the oldest one.

Any thoughts there would be appreciated.

Regards,

Atri

-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Jason Petersen
On Apr 18, 2014, at 1:51 PM, Atri Sharma atri.j...@gmail.com wrote:

 Counting clock sweeps is an intersting idea.  I think one concern was
 tracking hot buffers in cases where there is no memory pressure, and
 hence the clock sweep isn't running --- I am not sure how this would
 help in that case.
 


Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday 
seems silly to me: it was something quick for the prototype.

The problem with the clocksweeps is they don’t actually track the progression 
of “time” within the PostgreSQL system.

What’s wrong with using a transaction number or some similar sequence? It would 
accurately track “age” in the sense we care about: how long ago in “units of 
real work being done by the DB” something was added.

—Jason



Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Atri Sharma
Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday
 seems silly to me: it was something quick for the prototype.

 The problem with the clocksweeps is they don’t actually track the
 progression of “time” within the PostgreSQL system.

 What’s wrong with using a transaction number or some similar sequence? It
 would accurately track “age” in the sense we care about: how long ago in
 “units of real work being done by the DB” something was added.



Well, AIUI, we only need the 'relative' age of buffers in relation to the
youngest buffer present. So, the guy who has seen the maximum amount of
clocksweeps is the guy who has been around the most.

I do not see a need for an accurate estimate of the time spent in the
buffer for any purpose right now. It may be useful in the future though.

How do you get the transaction ID? By accessing a tuple on the page and
reading it's XMIN?

Regards,

Atri



-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Peter Geoghegan
On Fri, Apr 18, 2014 at 1:11 PM, Jason Petersen ja...@citusdata.com wrote:
 Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday
 seems silly to me: it was something quick for the prototype.

The gettimeofday() call doesn't need to happen in a tight loop. It can
be reasonably passed down from higher up, very probably without
consequence. The LRU-K paper actually recommends a delay of 5 seconds.
There is at least one other major database system that unambiguously
uses a wall-clock delay of 3 seconds (by default) for this exact
purpose - avoiding what the LRU-K paper calls correlated references.

I'm not saying that we should go with that particular scheme for these
reasons. However, it's plainly untrue that using wall time like this
represents some kind of insurmountable scalability obstacle.

 The problem with the clocksweeps is they don’t actually track the
 progression of “time” within the PostgreSQL system.

 What’s wrong with using a transaction number or some similar sequence? It
 would accurately track “age” in the sense we care about: how long ago in
 “units of real work being done by the DB” something was added.

The LRU-K paper suggests that a scheme like this could be preferable.
I have my doubts that it can be made to work with Postgres.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-18 Thread Bruce Momjian
On Sat, Apr 19, 2014 at 01:21:29AM +0530, Atri Sharma wrote:
 I feel that if there is no memory pressure, frankly it doesnt matter much 
 about
 what gets out and what not. The case I am specifically targeting is when the
 clocksweep gets to move about a lot i.e. high memory pressure workloads. Of
 course,  I may be totally wrong here.
 
 One thing that I discussed with Merlin offline and am now concerned about is
 how will the actual eviction work. We cannot traverse the entire list and then
 find all the buffers with refcount 0 and then do another traversal to find the
 oldest one.

I thought if there was memory pressure the clock sweep would run and we
wouldn't have everything at the max counter access value.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Greg Stark
On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas robertmh...@gmail.com wrote:
 This isn't a fundamental property of the usage-count idea; it's an
 artifact of the fact that usage count decreases are tied to eviction
 pressure rather than access pressure.  For example, suppose we made a
 rule that if the total usage counts of all buffers exceed 3 *
 NBuffers, then every time you bump the usage count of a buffer from N
 to N+1, you're required to advance the clock sweep far enough to
 decrease the reference count of a buffer by one.

This sounds like the right way to reason about it.

From what I remember in school the idea with the clock sweep is to set
the usage flags to the maximum whenever the buffer is used and
decrement (actually iirc typically shift right)  it when the clock
sweep goes by. Ie, simulate a LRU where when the buffer is accessed it
jumps to the head of the list and when the clock comes by it moves
gradually down the list.

What you're pointing out is that the clock might not come by very
often resulting everything being at the head of the list. In that case
I'm not clear it really matters what gets evicted though. And the cpu
effort of running the clock n times sounds bad but doing the work
earlier doesn't really change the amount of work being done, it just
amortizes it over more calls.

But if you want to do that it seems to me the way to do it is every
time a buffer is pinned set to the maximum and then run the clock
max_value - previous_value. So the total usage counts of all buffers
remains constant. If that results in contention one way to reduce it
is to do this probabilistically. Run the clock 1% of the time but run
it 100x as much as you would normally.

But I think you've misidentified the problem and what those other
algorithms are trying to solve. The problem is not that Postgres will
pick a bad buffer to evict. If all the buffers have been since the
last time the clock came around then they're all hot anyways and it
doesn't really matter which one we evict. The problem is that we
expend an inordinate amount of work finding the few non-hot buffers.
When you have a really large amount of memory and 99.9% of it is hot
but 0.1% is whatever random non-hot page was needed last then there's
an obvious buffer to evict when you need a new one. But we spend a lot
of work decrementing every hot buffer's usage count 4 times only to
have them immediately incremented again just to find the 1 buffer
where the usage count was 4 or 3. The goal of these algorithms that
divide the buffers into groups is to avoid having to do so much work
to find the colder buffers. Once the hot buffers migrate to the hot
pool we only need to run the clock there when we find we have new hot
pages that we want to promote. All the thrashing in the cold pool can
be more efficient because there's many fewer pages to consider.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Robert Haas
On Thu, Apr 17, 2014 at 9:40 AM, Greg Stark st...@mit.edu wrote:
 On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas robertmh...@gmail.com wrote:
 This isn't a fundamental property of the usage-count idea; it's an
 artifact of the fact that usage count decreases are tied to eviction
 pressure rather than access pressure.  For example, suppose we made a
 rule that if the total usage counts of all buffers exceed 3 *
 NBuffers, then every time you bump the usage count of a buffer from N
 to N+1, you're required to advance the clock sweep far enough to
 decrease the reference count of a buffer by one.

 This sounds like the right way to reason about it.

 From what I remember in school the idea with the clock sweep is to set
 the usage flags to the maximum whenever the buffer is used and
 decrement (actually iirc typically shift right)  it when the clock
 sweep goes by. Ie, simulate a LRU where when the buffer is accessed it
 jumps to the head of the list and when the clock comes by it moves
 gradually down the list.

 What you're pointing out is that the clock might not come by very
 often resulting everything being at the head of the list. In that case
 I'm not clear it really matters what gets evicted though. And the cpu
 effort of running the clock n times sounds bad but doing the work
 earlier doesn't really change the amount of work being done, it just
 amortizes it over more calls.

 But if you want to do that it seems to me the way to do it is every
 time a buffer is pinned set to the maximum and then run the clock
 max_value - previous_value. So the total usage counts of all buffers
 remains constant. If that results in contention one way to reduce it
 is to do this probabilistically. Run the clock 1% of the time but run
 it 100x as much as you would normally.

 But I think you've misidentified the problem and what those other
 algorithms are trying to solve. The problem is not that Postgres will
 pick a bad buffer to evict. If all the buffers have been since the
 last time the clock came around then they're all hot anyways and it
 doesn't really matter which one we evict. The problem is that we
 expend an inordinate amount of work finding the few non-hot buffers.
 When you have a really large amount of memory and 99.9% of it is hot
 but 0.1% is whatever random non-hot page was needed last then there's
 an obvious buffer to evict when you need a new one. But we spend a lot
 of work decrementing every hot buffer's usage count 4 times only to
 have them immediately incremented again just to find the 1 buffer
 where the usage count was 4 or 3. The goal of these algorithms that
 divide the buffers into groups is to avoid having to do so much work
 to find the colder buffers. Once the hot buffers migrate to the hot
 pool we only need to run the clock there when we find we have new hot
 pages that we want to promote. All the thrashing in the cold pool can
 be more efficient because there's many fewer pages to consider.

Well, I think Peter has proved that PostgreSQL *will* pick a bad
buffer to evict.  The proof is that when he changed the choice of
buffer to evict, he got a significant performance improvement.

I also believe this to be the case on first principles and my own
experiments.  Suppose you have a workload that fits inside
shared_buffers.  All of the usage counts will converge to 5.  Then,
somebody accesses a table that is not cached, so something's got to be
evicted.  Because all the usage counts are the same, the eviction at
this point is completely indiscriminate.  We're just as likely to kick
out a btree root page or a visibility map page as we are to kick out a
random heap page, even though the former have probably been accessed
several orders of magnitude more often.  That's clearly bad.  On
systems that are not too heavily loaded it doesn't matter too much
because we just fault the page right back in from the OS pagecache.
But I've done pgbench runs where such decisions lead to long stalls,
because the page has to be brought back in from disk, and there's a
long I/O queue; or maybe just because the kernel thinks PostgreSQL is
issuing too many I/O requests and makes some of them wait to cool
things down.

Of course, the overhead of repeated clock sweeps to push down the
usage counts isn't a great thing either.  I'm not saying that isn't a
problem.  But I think bad decisions about what to evict are also a
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Bruce Momjian
On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote:
 I also believe this to be the case on first principles and my own
 experiments.  Suppose you have a workload that fits inside
 shared_buffers.  All of the usage counts will converge to 5.  Then,
 somebody accesses a table that is not cached, so something's got to be
 evicted.  Because all the usage counts are the same, the eviction at
 this point is completely indiscriminate.  We're just as likely to kick
 out a btree root page or a visibility map page as we are to kick out a
 random heap page, even though the former have probably been accessed
 several orders of magnitude more often.  That's clearly bad.  On
 systems that are not too heavily loaded it doesn't matter too much
 because we just fault the page right back in from the OS pagecache.
 But I've done pgbench runs where such decisions lead to long stalls,
 because the page has to be brought back in from disk, and there's a
 long I/O queue; or maybe just because the kernel thinks PostgreSQL is
 issuing too many I/O requests and makes some of them wait to cool
 things down.

I understand now.  If there is no memory pressure, every buffer gets the
max usage count, and when a new buffer comes in, it isn't the max so it
is swiftly removed until the clock sweep has time to decrement the old
buffers.  Decaying buffers when there is no memory pressure creates
additional overhead and gets into timing issues of when to decay.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Robert Haas
On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote:
 I also believe this to be the case on first principles and my own
 experiments.  Suppose you have a workload that fits inside
 shared_buffers.  All of the usage counts will converge to 5.  Then,
 somebody accesses a table that is not cached, so something's got to be
 evicted.  Because all the usage counts are the same, the eviction at
 this point is completely indiscriminate.  We're just as likely to kick
 out a btree root page or a visibility map page as we are to kick out a
 random heap page, even though the former have probably been accessed
 several orders of magnitude more often.  That's clearly bad.  On
 systems that are not too heavily loaded it doesn't matter too much
 because we just fault the page right back in from the OS pagecache.
 But I've done pgbench runs where such decisions lead to long stalls,
 because the page has to be brought back in from disk, and there's a
 long I/O queue; or maybe just because the kernel thinks PostgreSQL is
 issuing too many I/O requests and makes some of them wait to cool
 things down.

 I understand now.  If there is no memory pressure, every buffer gets the
 max usage count, and when a new buffer comes in, it isn't the max so it
 is swiftly removed until the clock sweep has time to decrement the old
 buffers.  Decaying buffers when there is no memory pressure creates
 additional overhead and gets into timing issues of when to decay.

That can happen, but the real problem I was trying to get at is that
when all the buffers get up to max usage count, they all appear
equally important.  But in reality they're not.  So when we do start
evicting those long-resident buffers, it's essentially random which
one we kick out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Bruce Momjian
On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote:
 On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian br...@momjian.us wrote:
  On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote:
  I also believe this to be the case on first principles and my own
  experiments.  Suppose you have a workload that fits inside
  shared_buffers.  All of the usage counts will converge to 5.  Then,
  somebody accesses a table that is not cached, so something's got to be
  evicted.  Because all the usage counts are the same, the eviction at
  this point is completely indiscriminate.  We're just as likely to kick
  out a btree root page or a visibility map page as we are to kick out a
  random heap page, even though the former have probably been accessed
  several orders of magnitude more often.  That's clearly bad.  On
  systems that are not too heavily loaded it doesn't matter too much
  because we just fault the page right back in from the OS pagecache.
  But I've done pgbench runs where such decisions lead to long stalls,
  because the page has to be brought back in from disk, and there's a
  long I/O queue; or maybe just because the kernel thinks PostgreSQL is
  issuing too many I/O requests and makes some of them wait to cool
  things down.
 
  I understand now.  If there is no memory pressure, every buffer gets the
  max usage count, and when a new buffer comes in, it isn't the max so it
  is swiftly removed until the clock sweep has time to decrement the old
  buffers.  Decaying buffers when there is no memory pressure creates
  additional overhead and gets into timing issues of when to decay.
 
 That can happen, but the real problem I was trying to get at is that
 when all the buffers get up to max usage count, they all appear
 equally important.  But in reality they're not.  So when we do start
 evicting those long-resident buffers, it's essentially random which
 one we kick out.

True.  Ideally we would have some way to know that _all_ the buffers had
reached the maximum and kick off a sweep to decrement them all.  I am
unclear how we would do that.  One odd idea would be to have a global
counter that is incremented everytime a buffer goes from 4 to 5 (max)
--- when the counter equals 50% of all buffers, do a clock sweep.  Of
course, then the counter becomes a bottleneck.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Robert Haas
On Thu, Apr 17, 2014 at 10:48 AM, Bruce Momjian br...@momjian.us wrote:
  I understand now.  If there is no memory pressure, every buffer gets the
  max usage count, and when a new buffer comes in, it isn't the max so it
  is swiftly removed until the clock sweep has time to decrement the old
  buffers.  Decaying buffers when there is no memory pressure creates
  additional overhead and gets into timing issues of when to decay.

 That can happen, but the real problem I was trying to get at is that
 when all the buffers get up to max usage count, they all appear
 equally important.  But in reality they're not.  So when we do start
 evicting those long-resident buffers, it's essentially random which
 one we kick out.

 True.  Ideally we would have some way to know that _all_ the buffers had
 reached the maximum and kick off a sweep to decrement them all.  I am
 unclear how we would do that.  One odd idea would be to have a global
 counter that is incremented everytime a buffer goes from 4 to 5 (max)
 --- when the counter equals 50% of all buffers, do a clock sweep.  Of
 course, then the counter becomes a bottleneck.

Yeah, I think that's the right general line of thinking.  But it
doesn't have to be as coarse-grained as do a whole clock sweep.  It
can be, you know, for every buffer that gets incremented from 4 to 5,
run the clock sweep far enough to decrement the usage count of some
other buffer by one.  That's similar to your idea but you can do it a
bit at a time rather than having to make a complete pass over
shared_buffers all at once.

Your other point, that the counter can become the bottleneck, is quite
right also and a major problem in this area.  I don't know how to
solve it right at the moment, but I'm hopeful that there may be a way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Andres Freund
On 2014-04-17 10:48:15 -0400, Bruce Momjian wrote:
 On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote:
  That can happen, but the real problem I was trying to get at is that
  when all the buffers get up to max usage count, they all appear
  equally important.  But in reality they're not.  So when we do start
  evicting those long-resident buffers, it's essentially random which
  one we kick out.
 
 True.  Ideally we would have some way to know that _all_ the buffers had
 reached the maximum and kick off a sweep to decrement them all.  I am
 unclear how we would do that.  One odd idea would be to have a global
 counter that is incremented everytime a buffer goes from 4 to 5 (max)
 --- when the counter equals 50% of all buffers, do a clock sweep.  Of
 course, then the counter becomes a bottleneck.

I have my doubts that we'll make the current scheme, where buffer
reclaim essentially is O(NBuffers), work much better. Especially as CPU
cache effects make such large, high frequency, accesses really
expensive.
I think we need more drastic changes.

I am *not* suggesting that we do that, but I believe it'd be possible to
implement a full LRU and be faster than today in scenarios with
nontrivial amounts of shared buffers.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Greg Stark
On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas robertmh...@gmail.com wrote:
 Because all the usage counts are the same, the eviction at
 this point is completely indiscriminate.  We're just as likely to kick
 out a btree root page or a visibility map page as we are to kick out a
 random heap page, even though the former have probably been accessed
 several orders of magnitude more often.  That's clearly bad.

That's not clear at all. In that circumstance regardless of what page
you evict you're incurring precisely one page fault i/o when the page
is read back in. Incurring that i/o is bad but it's unavoidable and
it's the same badness regardless of what page it's for. The only way
to prefer one page over another is if one page won't be needed for
long enough for the page to be useful for caching this new buffer (or
mixture of buffers) for multiple accesses. If you can't do that then
it doesn't matter which buffer you use since it'll just be evicted to
read back in the original page again.



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Greg Stark
On Tue, Apr 15, 2014 at 7:30 PM, Peter Geoghegan p...@heroku.com wrote:
 Frankly, there doesn't need to be any research on this, because it's
 just common sense that probabilistically, leaf pages are much more
 useful than heap pages in servicing index scan queries if we assume a
 uniform distribution. If we don't assume that, then they're still more
 useful on average.

I don't think common sense is compelling. I think you need to pin
down exactly what it is about btree intermediate pages that the LRU
isn't capturing and not just argue they're more useful. The LRU is
already capturing which pages are more heavily used than others so you
need to identify what it is that makes index pages *even more* useful
than their frequency and recency of access indicates. Not just that
they're more useful than an average page.

So what I think is missing is that indexes are always accessed from
the root down to the leaf. So the most recent page accessed will
always be the leaf. And in whatever chain of pages was used to reach
the last leaf page the least recently accessed will always be the
root. But we'll need the root page again on the subsequent descent
even if it's to reach the same leaf page we kept in ram in preference
to it.

Now it doesn't *always* make sense to keep an intermediate page over
leaf pages. Imagine an index that we always do full traversals of.
We'll always descend from the root down the left-most pages and then
follow the right pointers across. All the other intermediate pages
will be cold. If we do an occasional descent probing for other keys
those leaf pages shouldn't be cached since they won't be needed again
for the common full index traversals and the next occasional probe
will probably be looking for different keys.

But if we're often probing for the same keys the last thing we want to
do is throw away one of the intermediate pages for those keys when we
could throw away a leaf page. But that's what would happen in a strict
LRU.  It's almost like what we would really want to do is mark the
pages as least recently used in the opposite order from the order
they're actually accessed when descending. Or perhaps bump the usage
count to max+1 when it's an intermediate page so that it takes one
extra cycle of decrementing before it's considered old compared to a
leaf page.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 several orders of magnitude more often.  That's clearly bad.  On
 systems that are not too heavily loaded it doesn't matter too much
 because we just fault the page right back in from the OS pagecache.

Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
long enough, the kernel will happily evict it as 'cold' from *its*
cache, leading to...

 But I've done pgbench runs where such decisions lead to long stalls,
 because the page has to be brought back in from disk, and there's a
 long I/O queue; or maybe just because the kernel thinks PostgreSQL is
 issuing too many I/O requests and makes some of them wait to cool
 things down.

Exactly this.

 Of course, the overhead of repeated clock sweeps to push down the
 usage counts isn't a great thing either.  I'm not saying that isn't a
 problem.  But I think bad decisions about what to evict are also a
 problem.

Using a bit more CPU here and there, particularly if it's done in a
background worker, or ideally multiple background workers (for each
buffer pool) would be much better than evicting a hot page that isn't in
the kernel's buffer either 'cause we've held on to it long enough that
the kernel thinks it's cold.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Greg Stark
On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote:
 Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
 long enough, the kernel will happily evict it as 'cold' from *its*
 cache, leading to...


This is a whole nother problem.

It is worrisome that we could be benchmarking the page replacement
algorithm in Postgres and choose a page replacement algorithm that
chooses pages that performs well because it tends to evict pages that
are in the OS cache. And then one day (hopefully not too far off)
we'll fix the double buffering problem and end up with a strange
choice of page replacement algorithm.

It also means that every benchmark is super sensitive to the how large
a fraction of system memory Postgres is managing. If A benchmark of a
page replacement algorithm with 3GB shared buffers might perform well
compared to others on a system with 8GB or 32GB total RAM but actually
be choosing pages very poorly in normal terms and perform terribly on
a system with 4GB total ram.

Ideally what I would like to see is instrumentation of Postgres's
buffer pinning so we can generate various test loads and then just run
the different algorithms on them and measure precisely how many page
evictions it's causing and when how often it's choosing pages that
need to be read in soon after and so on. We shouldn't have to run
Postgres to get these counts at all, just run the algorithm as we read
through a text file (or database table) listing the pages being
accessed.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Peter Geoghegan
On Thu, Apr 17, 2014 at 9:21 AM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 several orders of magnitude more often.  That's clearly bad.  On
 systems that are not too heavily loaded it doesn't matter too much
 because we just fault the page right back in from the OS pagecache.

 Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
 long enough, the kernel will happily evict it as 'cold' from *its*
 cache, leading to...

 But I've done pgbench runs where such decisions lead to long stalls,
 because the page has to be brought back in from disk, and there's a
 long I/O queue; or maybe just because the kernel thinks PostgreSQL is
 issuing too many I/O requests and makes some of them wait to cool
 things down.

 Exactly this.

Yes, I believe that's why this is so effective.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
* Greg Stark (st...@mit.edu) wrote:
 On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote:
  Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
  long enough, the kernel will happily evict it as 'cold' from *its*
  cache, leading to...
 
 This is a whole nother problem.
 
 It is worrisome that we could be benchmarking the page replacement
 algorithm in Postgres and choose a page replacement algorithm that
 chooses pages that performs well because it tends to evict pages that
 are in the OS cache. And then one day (hopefully not too far off)
 we'll fix the double buffering problem and end up with a strange
 choice of page replacement algorithm.

That's certainly possible but I don't see the double buffering problem
going away any time particularly soon and, even if it does, it's likely
to either a) mean we're just using the kernel's cache (eg: something w/
mmap, etc), or b) will involve so many other changes that this will end
up getting changed anyway.  In any case, while I think we should
document any such cache management system we employ as having this risk,
I don't think we should worry about it terribly much.

 It also means that every benchmark is super sensitive to the how large
 a fraction of system memory Postgres is managing. If A benchmark of a
 page replacement algorithm with 3GB shared buffers might perform well
 compared to others on a system with 8GB or 32GB total RAM but actually
 be choosing pages very poorly in normal terms and perform terribly on
 a system with 4GB total ram.

I'm not following you here- benchmarks are already sensitive to how much
of the system's memory PG is managing (and how much ends up being
*dedicated* to PG's cache and therefore unavailable for other work).

 Ideally what I would like to see is instrumentation of Postgres's
 buffer pinning so we can generate various test loads and then just run
 the different algorithms on them and measure precisely how many page
 evictions it's causing and when how often it's choosing pages that
 need to be read in soon after and so on. We shouldn't have to run
 Postgres to get these counts at all, just run the algorithm as we read
 through a text file (or database table) listing the pages being
 accessed.

Go for it.  I'd love to see that also.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Andres Freund
On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote:
 On 04/17/2014 09:38 PM, Stephen Frost wrote:
 * Greg Stark (st...@mit.edu) wrote:
 On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote:
 Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
 long enough, the kernel will happily evict it as 'cold' from *its*
 cache, leading to...
 
 This is a whole nother problem.
 
 It is worrisome that we could be benchmarking the page replacement
 algorithm in Postgres and choose a page replacement algorithm that
 chooses pages that performs well because it tends to evict pages that
 are in the OS cache. And then one day (hopefully not too far off)
 we'll fix the double buffering problem and end up with a strange
 choice of page replacement algorithm.
 
 That's certainly possible but I don't see the double buffering problem
 going away any time particularly soon and, even if it does, it's likely
 to either a) mean we're just using the kernel's cache (eg: something w/
 mmap, etc), or b) will involve so many other changes that this will end
 up getting changed anyway.  In any case, while I think we should
 document any such cache management system we employ as having this risk,
 I don't think we should worry about it terribly much.
 
 Note that if we somehow come up with a page replacement algorithm that tends
 to evict pages that are in the OS cache, we have effectively solved the
 double buffering problem. When a page is cached in both caches, evicting it
 from one of them eliminates the double buffering. Granted, you might prefer
 to evict it from the OS cache instead, and such an algorithm could be bad in
 other ways. But if a page replacement algorithm happens avoid double
 buffering, that's a genuine merit for that algorithm.

I don't think it's a good idea to try to synchronize algorithms with the
OSs. There's so much change about the caching logic in e.g. linux that
it won't stay effective for very long.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
  Note that if we somehow come up with a page replacement algorithm that tends
  to evict pages that are in the OS cache, we have effectively solved the
  double buffering problem. When a page is cached in both caches, evicting it
  from one of them eliminates the double buffering. Granted, you might prefer
  to evict it from the OS cache instead, and such an algorithm could be bad in
  other ways. But if a page replacement algorithm happens avoid double
  buffering, that's a genuine merit for that algorithm.
 
 I don't think it's a good idea to try to synchronize algorithms with the
 OSs. There's so much change about the caching logic in e.g. linux that
 it won't stay effective for very long.

There's also more than one OS...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Heikki Linnakangas

On 04/17/2014 09:38 PM, Stephen Frost wrote:

* Greg Stark (st...@mit.edu) wrote:

On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote:

Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
long enough, the kernel will happily evict it as 'cold' from *its*
cache, leading to...


This is a whole nother problem.

It is worrisome that we could be benchmarking the page replacement
algorithm in Postgres and choose a page replacement algorithm that
chooses pages that performs well because it tends to evict pages that
are in the OS cache. And then one day (hopefully not too far off)
we'll fix the double buffering problem and end up with a strange
choice of page replacement algorithm.


That's certainly possible but I don't see the double buffering problem
going away any time particularly soon and, even if it does, it's likely
to either a) mean we're just using the kernel's cache (eg: something w/
mmap, etc), or b) will involve so many other changes that this will end
up getting changed anyway.  In any case, while I think we should
document any such cache management system we employ as having this risk,
I don't think we should worry about it terribly much.


Note that if we somehow come up with a page replacement algorithm that 
tends to evict pages that are in the OS cache, we have effectively 
solved the double buffering problem. When a page is cached in both 
caches, evicting it from one of them eliminates the double buffering. 
Granted, you might prefer to evict it from the OS cache instead, and 
such an algorithm could be bad in other ways. But if a page replacement 
algorithm happens avoid double buffering, that's a genuine merit for 
that algorithm.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Merlin Moncure
On Thu, Apr 17, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote:
 On 04/17/2014 09:38 PM, Stephen Frost wrote:
 * Greg Stark (st...@mit.edu) wrote:
 On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote:
 Ehhh.  No.  If it's a hot page that we've been holding in *our* cache
 long enough, the kernel will happily evict it as 'cold' from *its*
 cache, leading to...
 
 This is a whole nother problem.
 
 It is worrisome that we could be benchmarking the page replacement
 algorithm in Postgres and choose a page replacement algorithm that
 chooses pages that performs well because it tends to evict pages that
 are in the OS cache. And then one day (hopefully not too far off)
 we'll fix the double buffering problem and end up with a strange
 choice of page replacement algorithm.
 
 That's certainly possible but I don't see the double buffering problem
 going away any time particularly soon and, even if it does, it's likely
 to either a) mean we're just using the kernel's cache (eg: something w/
 mmap, etc), or b) will involve so many other changes that this will end
 up getting changed anyway.  In any case, while I think we should
 document any such cache management system we employ as having this risk,
 I don't think we should worry about it terribly much.

 Note that if we somehow come up with a page replacement algorithm that tends
 to evict pages that are in the OS cache, we have effectively solved the
 double buffering problem. When a page is cached in both caches, evicting it
 from one of them eliminates the double buffering. Granted, you might prefer
 to evict it from the OS cache instead, and such an algorithm could be bad in
 other ways. But if a page replacement algorithm happens avoid double
 buffering, that's a genuine merit for that algorithm.

 I don't think it's a good idea to try to synchronize algorithms with the
 OSs. There's so much change about the caching logic in e.g. linux that
 it won't stay effective for very long.

No. but if you were very judicious, maybe you could hint the o/s
(posix_fadvise) about pages that are likely to stay hot that you don't
need them.

I doubt that's necessary though -- if the postgres caching algorithm
improves such that there is a better tendency for hot pages to stay in
s_b,  Eventually the O/S will deschedule the page for something else
that needs it.   In other words, otherwise preventable double
buffering is really a measurement of bad eviction policy because it
manifests in volatility of frequency accessed pages.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Peter Geoghegan
On Thu, Apr 17, 2014 at 11:53 AM, Merlin Moncure mmonc...@gmail.com wrote:
 No. but if you were very judicious, maybe you could hint the o/s
 (posix_fadvise) about pages that are likely to stay hot that you don't
 need them.

Mitsumasa KONDO wrote a patch like that. I don't think the results
were that promising, but things change quickly.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
* Merlin Moncure (mmonc...@gmail.com) wrote:
 I doubt that's necessary though -- if the postgres caching algorithm
 improves such that there is a better tendency for hot pages to stay in
 s_b,  Eventually the O/S will deschedule the page for something else
 that needs it.   In other words, otherwise preventable double
 buffering is really a measurement of bad eviction policy because it
 manifests in volatility of frequency accessed pages.

I wonder if it would help to actually tell the OS to read in buffers
that we're *evicting*...  On the general notion that if the OS already
has them buffered then it's almost a no-op, and if it doesn't and it's
actually a 'hot' buffer that we're gonna need again shortly, the OS will
have it.

In other words, try to make the OS more like a secondary cache to ours
by encouraging it to cache things we're evicting.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Merlin Moncure
On Thu, Apr 17, 2014 at 2:00 PM, Stephen Frost sfr...@snowman.net wrote:
 * Merlin Moncure (mmonc...@gmail.com) wrote:
 I doubt that's necessary though -- if the postgres caching algorithm
 improves such that there is a better tendency for hot pages to stay in
 s_b,  Eventually the O/S will deschedule the page for something else
 that needs it.   In other words, otherwise preventable double
 buffering is really a measurement of bad eviction policy because it
 manifests in volatility of frequency accessed pages.

 I wonder if it would help to actually tell the OS to read in buffers
 that we're *evicting*...  On the general notion that if the OS already
 has them buffered then it's almost a no-op, and if it doesn't and it's
 actually a 'hot' buffer that we're gonna need again shortly, the OS will
 have it.

 In other words, try to make the OS more like a secondary cache to ours
 by encouraging it to cache things we're evicting.

 I don't think this would work unless we would keep some kind of
tracking information on the page itself which seems not worth a write
operation to do (maybe if the page is dirtied it could be snuck in
there though...).  IOW, it would only make sense to do this if we knew
that this page was likely to be read in again.  This might be true in
general on particular workloads but is probably a pretty flimsy
assumption without supporting evidence; probably better to let the O/S
deal with it.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 I wonder if it would help to actually tell the OS to read in buffers
 that we're *evicting*...  On the general notion that if the OS already
 has them buffered then it's almost a no-op, and if it doesn't and it's
 actually a 'hot' buffer that we're gonna need again shortly, the OS will
 have it.

But if it's actually gone cold, you're just forcing unnecessary read I/O,
not to mention possibly causing something slightly warmer to be lost from
kernel cache.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
* Merlin Moncure (mmonc...@gmail.com) wrote:
  I don't think this would work unless we would keep some kind of
 tracking information on the page itself which seems not worth a write
 operation to do (maybe if the page is dirtied it could be snuck in
 there though...).  IOW, it would only make sense to do this if we knew
 that this page was likely to be read in again.  This might be true in
 general on particular workloads but is probably a pretty flimsy
 assumption without supporting evidence; probably better to let the O/S
 deal with it.

The trouble is that we're ending up hiding the information from the OS
about the frequency of utilization of that page.  You have a good point
and we wouldn't want to do this for pages that are just accessed once or
similar, but perhaps just mark a page that's reached the 'max' as having
been 'hot' and then, for those pages, advise the OS that while we're
under pressure and need to push this page out, it was once pretty hottly
used and therefore we may want it again soon.

For pages that never reach the 'max' level, we wouldn't do anything on
the assumption that those were only temporairly needed.

Just some thoughts.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Stephen Frost sfr...@snowman.net writes:
  I wonder if it would help to actually tell the OS to read in buffers
  that we're *evicting*...  On the general notion that if the OS already
  has them buffered then it's almost a no-op, and if it doesn't and it's
  actually a 'hot' buffer that we're gonna need again shortly, the OS will
  have it.
 
 But if it's actually gone cold, you're just forcing unnecessary read I/O,
 not to mention possibly causing something slightly warmer to be lost from
 kernel cache.

Certainly possible- see the email I just sent about another thought
around this.

Obviously, none of these thoughts are really fully formed solutions and
are, instead, just speculation and ideas.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Merlin Moncure
On Thu, Apr 17, 2014 at 2:16 PM, Stephen Frost sfr...@snowman.net wrote:
 * Merlin Moncure (mmonc...@gmail.com) wrote:
  I don't think this would work unless we would keep some kind of
 tracking information on the page itself which seems not worth a write
 operation to do (maybe if the page is dirtied it could be snuck in
 there though...).  IOW, it would only make sense to do this if we knew
 that this page was likely to be read in again.  This might be true in
 general on particular workloads but is probably a pretty flimsy
 assumption without supporting evidence; probably better to let the O/S
 deal with it.

 The trouble is that we're ending up hiding the information from the OS
 about the frequency of utilization of that page.  You have a good point
 and we wouldn't want to do this for pages that are just accessed once or
 similar, but perhaps just mark a page that's reached the 'max' as having
 been 'hot' and then, for those pages, advise the OS that while we're
 under pressure and need to push this page out, it was once pretty hottly
 used and therefore we may want it again soon.

 For pages that never reach the 'max' level, we wouldn't do anything on
 the assumption that those were only temporairly needed.

yeah -- the thing is, we are already too spendy already on
supplemental write i/o (hint bits, visible bits, freezing, etc) and
likely not worth it to throw something else on the pile unless the
page is already dirty; the medium term trend in storage is that read
vs write performance is becoming increasingly asymmetric, particularly
on the random side so it's very unlikely to balance out.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
On Thursday, April 17, 2014, Merlin Moncure mmonc...@gmail.com wrote:

 yeah -- the thing is, we are already too spendy already on
 supplemental write i/o (hint bits, visible bits, freezing, etc) and
 likely not worth it to throw something else on the pile unless the
 page is already dirty; the medium term trend in storage is that read
 vs write performance is becoming increasingly asymmetric, particularly
 on the random side so it's very unlikely to balance out.


Guess I wasn't clear but I was thinking to read the page in, not do any
writing, and do it in a asynchronous way to the process doing the evicting.

Thanks,

Stephen


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Merlin Moncure
On Thu, Apr 17, 2014 at 2:28 PM, Stephen Frost sfr...@snowman.net wrote:


 On Thursday, April 17, 2014, Merlin Moncure mmonc...@gmail.com wrote:

 yeah -- the thing is, we are already too spendy already on
 supplemental write i/o (hint bits, visible bits, freezing, etc) and
 likely not worth it to throw something else on the pile unless the
 page is already dirty; the medium term trend in storage is that read
 vs write performance is becoming increasingly asymmetric, particularly
 on the random side so it's very unlikely to balance out.

 Guess I wasn't clear but I was thinking to read the page in, not do any
 writing, and do it in a asynchronous way to the process doing the evicting.

no -- I got you. My point was, that's a pure guess unless you base it
on evidence recorded on the page itself.  Without that evidence,
(which requires writing) the operating is in a a better place to make
that guess so it's probably better to defer that decision.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Stephen Frost
On Thursday, April 17, 2014, Merlin Moncure mmonc...@gmail.com wrote:

 no -- I got you. My point was, that's a pure guess unless you base it
 on evidence recorded on the page itself.  Without that evidence,
 (which requires writing) the operating is in a a better place to make
 that guess so it's probably better to defer that decision.


Well, we'd only need that info to be stored in the buffer cache somehow-
wouldn't have to go to disk or cause more I/O, of course. My thinking was
that we could track it with the existing counter too, avoiding even that
small amount of locking to write to the buffer page.

Thanks,

Stephen


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Peter Geoghegan
On Thu, Apr 17, 2014 at 8:10 AM, Greg Stark st...@mit.edu wrote:
 I don't think common sense is compelling. I think you need to pin
 down exactly what it is about btree intermediate pages that the LRU
 isn't capturing and not just argue they're more useful. The LRU is
 already capturing which pages are more heavily used than others so you
 need to identify what it is that makes index pages *even more* useful
 than their frequency and recency of access indicates. Not just that
 they're more useful than an average page.

See example 1.1 within the LRU-K paper.

 So what I think is missing is that indexes are always accessed from
 the root down to the leaf. So the most recent page accessed will
 always be the leaf. And in whatever chain of pages was used to reach
 the last leaf page the least recently accessed will always be the
 root. But we'll need the root page again on the subsequent descent
 even if it's to reach the same leaf page we kept in ram in preference
 to it.

I can't imagine that this is much of a problem in practice. Consider
the break-down of pages within indexes when pgbench scale is 5,000, as
in my original benchmark:

[local] pg@pgbench=# with tots as (
SELECT count(*) c, type, relname from
(select relname, relpages, generate_series(1, relpages - 1) i
from pg_class c join pg_namespace n on c.relnamespace = n.oid where
relkind = 'i' and nspname = 'public') r,
lateral (select * from bt_page_stats(relname, i)) u
group by relname, type)
select tots.relname, relpages -1 as non_meta_pages, c, c/sum(c)
over(partition by tots.relname) as prop_of_index, type from tots join
pg_class c on c.relname = tots.relname order by 2 desc, 1, type;

relname| non_meta_pages |c|
prop_of_index| type
---++-++--
 pgbench_accounts_pkey |1370950 |4828 |
0.00352164557423684307 | i
 pgbench_accounts_pkey |1370950 | 1366121 |
0.99647762500455888253 | l
 pgbench_accounts_pkey |1370950 |   1 |
0.00729421204274408257 | r
 pgbench_tellers_pkey  |274 | 273 |
0.99635036496350364964 | l
 pgbench_tellers_pkey  |274 |   1 |
0.00364963503649635036 | r
 pgbench_branches_pkey | 28 |  27 |
0.96428571428571428571 | l
 pgbench_branches_pkey | 28 |   1 |
0.03571428571428571429 | r
(7 rows)

Time: 14562.297 ms

Just over 99.6% of pages (leaving aside the meta page) in the big 10
GB pgbench_accounts_pkey index are leaf pages. The inner pages and
root page are at an enormous advantage. In this example, the other
indexes don't even have what would be separately classified as an
inner page (and not a root page) at all, because it's perfectly
sufficient to only have a root page to get to any one of, say, 273
leaf pages (in the case of pgbench_tellers_pkey here).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Andres Freund
On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote:
 Just over 99.6% of pages (leaving aside the meta page) in the big 10
 GB pgbench_accounts_pkey index are leaf pages.

That's a rather nice number. I knew it was big, but I'd have guessed
it'd be a percent lower.

Do you happen to have the same stat handy for a sensibly wide text or
numeric real world index? It'd be interesting to see what the worst case
there is.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Peter Geoghegan
On Thu, Apr 17, 2014 at 1:39 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote:
 Just over 99.6% of pages (leaving aside the meta page) in the big 10
 GB pgbench_accounts_pkey index are leaf pages.

 That's a rather nice number. I knew it was big, but I'd have guessed
 it'd be a percent lower.

Yes, it's usually past 99.5% for int4. It's really bad if it's as low
as 96%, and I think that often points to what are arguably bad
indexing choices, like indexing text columns that have long text
strings.

 Do you happen to have the same stat handy for a sensibly wide text or
 numeric real world index? It'd be interesting to see what the worst case
 there is.

Yes, as it happens I do:
http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=a...@mail.gmail.com

I was working of my Mouse Genome database, which is actually
real-world data use by medical researchers, stored in a PostgreSQL
database by those researchers and made available for the benefit of
other medical researchers.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Peter Geoghegan
On Thu, Apr 17, 2014 at 1:33 PM, Peter Geoghegan p...@heroku.com wrote:
 I can't imagine that this is much of a problem in practice.

Although I will add that not caching highly useful inner pages for the
medium term, because that index isn't being used at all for 5 minutes
probably is very bad. Using the 4,828 buffers that it would take to
store all the inner pages (as in my large primary index example) to go
store something else is probably penny wise and pound foolish.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Greg Stark
On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan p...@heroku.com wrote:
 Although I will add that not caching highly useful inner pages for the
 medium term, because that index isn't being used at all for 5 minutes
 probably is very bad. Using the 4,828 buffers that it would take to
 store all the inner pages (as in my large primary index example) to go
 store something else is probably penny wise and pound foolish.

But there could easily be 20 unused indexes for every 1 index that is
being used.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-17 Thread Peter Geoghegan
On Thu, Apr 17, 2014 at 6:50 PM, Greg Stark st...@mit.edu wrote:
 On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan p...@heroku.com wrote:
 Although I will add that not caching highly useful inner pages for the
 medium term, because that index isn't being used at all for 5 minutes
 probably is very bad. Using the 4,828 buffers that it would take to
 store all the inner pages (as in my large primary index example) to go
 store something else is probably penny wise and pound foolish.

 But there could easily be 20 unused indexes for every 1 index that is
 being used.

Sure, but then there might not be. Obviously there is a trade-off to
be made between recency and frequency. One interesting observation in
the LRU-K paper is that for their test case, a pure LFU actually works
very well, despite, as the authors acknowledge, being a terrible
algorithm in the real world. That's because their test case is so
simple, and concerns only one table/index, with a uniform
distribution.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Peter Geoghegan
On Tue, Apr 15, 2014 at 9:44 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan p...@heroku.com wrote:
 In the past, various hackers have noted problems they've observed with
 this scheme. A common pathology is to see frantic searching for a
 victim buffer only to find all buffer usage_count values at 5. It may
 take multiple revolutions of the clock hand before a victim buffer is
 found, as usage_count is decremented for each and every buffer.  Also,
 BufFreelistLock contention is considered a serious bottleneck [1],
 which is related.

 I think that the basic problem here is that usage counts increase when
 buffers are referenced, but they decrease when buffers are evicted,
 and those two things are not in any necessary way connected to each
 other.  In particular, if no eviction is happening, reference counts
 will converge to the maximum value.  I've read a few papers about
 algorithms that attempt to segregate the list of buffers into hot
 and cold lists, and an important property of such algorithms is that
 they mustn't be allowed to make everything hot.

It's possible that I've misunderstood what you mean here, but do you
really think it's likely that everything will be hot, in the event of
using something like what I've sketched here? I think it's an
important measure against this general problem that buffers really
earn the right to be considered hot, so to speak. With my prototype,
in order for a buffer to become as hard to evict as possible, at a
minimum it must be *continually* pinned for at least 30 seconds.
That's actually a pretty tall order. Although, as I said, I wouldn't
be surprised if it was worth making it possible for buffers to be even
more difficult to evict than that. It should be extremely difficult to
evict a root B-Tree page, and to a lesser extent inner pages even
under a lot of cache pressure, for example. There are lots of
workloads in which that can happen, and I have a hard time believing
that it's worth it to evict given the extraordinary difference in
their utility as compared to a lot of other things. I can imagine a
huge barrier against evicting what is actually a relatively tiny
number of pages being worth it.

I don't want to dismiss what you're saying about heating and cooling
being unrelated, but I don't find the conclusion that not everything
can be hot obvious. Maybe heat should be relative rather than
absolute, and maybe that's actually what you meant. There is surely
some workload where buffer access actually is perfectly uniform, and
what do you do there? What temperature are those buffers?

It occurs to me that within the prototype patch, even though
usage_count is incremented in a vastly slower fashion (in a wall time
sense), clock sweep doesn't take advantage of that. I should probably
investigate having clock sweep become more aggressive in decrementing
in response to realizing that it won't get some buffer's usage_count
down to zero on the next revolution either. There are certainly
problems with that, but they might be fixable. Within the patch, in
order for it to be possible for the usage_count to be incremented in
the interim, an average of 1.5 seconds must pass, so if clock sweep
were to anticipate another no-set-to-zero revolution, it seems pretty
likely that it would be exactly right, or if not then close enough,
since it can only really fail to correct for some buffers getting
incremented once more in the interim. Conceptually, it would be like
multiple logical revolutions were merged into one actual one,
sufficient to have the next revolution find a victim buffer.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Andres Freund
Hi,

It's good to see focus on this - some improvements around s_b are sorely
needed.

On 2014-04-14 10:11:53 -0700, Peter Geoghegan wrote:
 1) Throttles incrementation of usage_count temporally. It becomes
 impossible to increment usage_count for any given buffer more
 frequently than every 3 seconds, while decrementing usage_count is
 totally unaffected.

I think this is unfortunately completely out of question. For one a
gettimeofday() for every uffer pin will become a significant performance
problem. Even the computation of the xact/stm start/stop timestamps
shows up pretty heavily in profiles today - and they are far less
frequent than buffer pins. And that's on x86 linux, where gettimeofday()
is implemented as something more lightweight than a full syscall.

The other significant problem I see with this is that its not adaptive
to the actual throughput of buffers in s_b. In many cases there's
hundreds of clock cycles through shared buffers in 3 seconds. By only
increasing the usagecount that often you've destroyed the little
semblance to a working LRU there is right now.

It also wouldn't work well for situations with a fast changing
workload  s_b. If you have frequent queries that take a second or so
and access some data repeatedly (index nodes or whatnot) only increasing
the usagecount once will mean they'll continually fall back to disk access.

 2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not
 5 as before. ... . This step on its own would be assumed extremely
 counter-productive by those in the know, but I believe that other
 measures ameliorate the downsides. I could be wrong about how true
 that is in other cases, but then the case helped here isn't what you'd
 call a narrow benchmark.

I don't see which mechanisms you have suggested that counter this?

I think having more granular usagecount is a good idea, but I don't
think it can realistically be implemented with the current method of
choosing victim buffers. The amount of cacheline misses around that is
already a major scalability limit; we surely can't make this even
worse. I think it'd be possible to get back to this if we had a better
bgwriter implementation.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Peter Geoghegan
On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund and...@2ndquadrant.com wrote:
 I think this is unfortunately completely out of question. For one a
 gettimeofday() for every uffer pin will become a significant performance
 problem. Even the computation of the xact/stm start/stop timestamps
 shows up pretty heavily in profiles today - and they are far less
 frequent than buffer pins. And that's on x86 linux, where gettimeofday()
 is implemented as something more lightweight than a full syscall.

Come on, Andres. Of course exactly what I've done here is completely
out of the question as a patch that we can go and commit right now.
I've numerous caveats about bloating the buffer descriptors, and about
it being a proof of concept. I'm pretty sure we can come up with a
scheme to significantly cut down on the number of gettimeofday() calls
if it comes down to it. In any case, I'm interested in advancing our
understanding of the problem right now. Let's leave the minutiae to
one side for the time being.

 The other significant problem I see with this is that its not adaptive
 to the actual throughput of buffers in s_b. In many cases there's
 hundreds of clock cycles through shared buffers in 3 seconds. By only
 increasing the usagecount that often you've destroyed the little
 semblance to a working LRU there is right now.

If a usage_count can get to BM_MAX_USAGE_COUNT from its initial
allocation within an instant, that's bad. It's that simple. Consider
all the ways in which that can happen almost by accident.

You could probably reasonably argue that the trade-off or lack of
adaption (between an LRU and an LFU) that this particular sketch of
mine represents is inappropriate or sub-optimal, but I don't
understand why you're criticizing the patch for doing what I expressly
set out to do. I wrote I think a very real problem that may be that
approximating an LRU is bad because an actual LRU is bad.

 It also wouldn't work well for situations with a fast changing
 workload  s_b. If you have frequent queries that take a second or so
 and access some data repeatedly (index nodes or whatnot) only increasing
 the usagecount once will mean they'll continually fall back to disk access.

No, it shouldn't, because there is a notion of buffers getting a fair
chance to prove themselves. Now, it might well be the case that there
are workloads where what I've done to make that happen in this
prototype doesn't work out too well - I've already said so. But should
a buffer get a usage count of 5 just because the user inserted 5
tuples within a single DML command, for example? If so, why?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Andres Freund
On 2014-04-16 01:58:23 -0700, Peter Geoghegan wrote:
 On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I think this is unfortunately completely out of question. For one a
  gettimeofday() for every uffer pin will become a significant performance
  problem. Even the computation of the xact/stm start/stop timestamps
  shows up pretty heavily in profiles today - and they are far less
  frequent than buffer pins. And that's on x86 linux, where gettimeofday()
  is implemented as something more lightweight than a full syscall.
 
 Come on, Andres. Of course exactly what I've done here is completely
 out of the question as a patch that we can go and commit right now.
 I've numerous caveats about bloating the buffer descriptors, and about
 it being a proof of concept. I'm pretty sure we can come up with a
 scheme to significantly cut down on the number of gettimeofday() calls
 if it comes down to it. In any case, I'm interested in advancing our
 understanding of the problem right now. Let's leave the minutiae to
 one side for the time being.

*I* don't think any scheme that involves measuring the time around
buffer pins is going to be acceptable. It's better than I say that now
rather than when you've invested significant time into the approach, no?

  The other significant problem I see with this is that its not adaptive
  to the actual throughput of buffers in s_b. In many cases there's
  hundreds of clock cycles through shared buffers in 3 seconds. By only
  increasing the usagecount that often you've destroyed the little
  semblance to a working LRU there is right now.
 
 If a usage_count can get to BM_MAX_USAGE_COUNT from its initial
 allocation within an instant, that's bad. It's that simple. Consider
 all the ways in which that can happen almost by accident.

Yes, I agree that that's a problem. It immediately going down to zero is
a problem as well though. And that's what will happen in many scenarios,
because you have time limits on increasing the usagecount, but not when
decreasing.

  It also wouldn't work well for situations with a fast changing
  workload  s_b. If you have frequent queries that take a second or so
  and access some data repeatedly (index nodes or whatnot) only increasing
  the usagecount once will mean they'll continually fall back to disk access.
 
 No, it shouldn't, because there is a notion of buffers getting a fair
 chance to prove themselves.

If you have a workload with  (BM_MAX_USAGE_COUNT + 1) clock
cycles/second, how does *any* buffer has a chance to prove itself?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Peter Geoghegan
On Wed, Apr 16, 2014 at 2:18 AM, Andres Freund and...@2ndquadrant.com wrote:
 *I* don't think any scheme that involves measuring the time around
 buffer pins is going to be acceptable. It's better than I say that now
 rather than when you've invested significant time into the approach, no?

Well, I do think that it will be possible to make something like that
work. LRU-K/LRU-2 involves remembering the last two access times (not
the last one). Researchers considered preeminent authorities on
caching algorithms thought that was a good idea in 1993. There are
plenty of other examples of similar schemes too.

 No, it shouldn't, because there is a notion of buffers getting a fair
 chance to prove themselves.

 If you have a workload with  (BM_MAX_USAGE_COUNT + 1) clock
 cycles/second, how does *any* buffer has a chance to prove itself?

There could be lots of ways. I thought about representing that more
directly. I don't think that it's useful to have a large number of
revolutions in search of a victim under any circumstances.
Fundamentally, you're asking what if any scheme here leans too
heavily towards frequency?. That could certainly be a problem, as
I've said, and we could think about adaptation over heuristics, as
I've said, but it is very obviously a big problem that clock sweep
doesn't really care about frequency one bit right now.

Why should I be the one with all the answers? Aren't you interested in
the significance of the patch, and the test case?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Andres Freund
Hi,

On 2014-04-16 02:57:54 -0700, Peter Geoghegan wrote:
 Why should I be the one with all the answers?

Who said you need to be? The only thing I am saying is that I don't
agree with some of your suggestions?

I only responded to the thread now because downthread (in
CAM3SWZQa2OAVUrfPL-df=we1smozkbr392sw_novukzepxh...@mail.gmail.com) you
further argued using the timestamp - which I think is a flawed
concept. So I thought it'd be fair to argue against it now, rather than
later.

 Aren't you interested in the significance of the patch, and the test case?

Not particularly in the specifics to be honest. The tradeoffs of the
techniques you used in there seem prohibitive to me. It's easy to make
individual cases faster by sacrificing others.
Sometimes it's useful to prototype solutions while narrowing the scope
for evaluation to get faster feedback, but as I don't see the solutions
to be applicable in the general case...

I think it's very important to improve upon the current state. It's imo
one of postgres' biggest issues. But it's also far from trivial,
otherwise it'd be done already.

I *personally* don't think it's very likely that we can improve
significantly upon the current state as long as every process regularly
participates in the clock sweep. ISTM that prevents many more elaborate
techniques to be used (cache misses/bus traffic, locking). But that's
just gut feeling.
I also think there are bigger issues than the actual LRU/whatever
behaviour, namely the scalability issues around shared buffers making
both small and big s_b settings major bottlenecks. But that's just where
I have seen more issues personally.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Robert Haas
On Wed, Apr 16, 2014 at 3:22 AM, Peter Geoghegan p...@heroku.com wrote:
 It's possible that I've misunderstood what you mean here, but do you
 really think it's likely that everything will be hot, in the event of
 using something like what I've sketched here? I think it's an
 important measure against this general problem that buffers really
 earn the right to be considered hot, so to speak. With my prototype,
 in order for a buffer to become as hard to evict as possible, at a
 minimum it must be *continually* pinned for at least 30 seconds.
 That's actually a pretty tall order. Although, as I said, I wouldn't
 be surprised if it was worth making it possible for buffers to be even
 more difficult to evict than that. It should be extremely difficult to
 evict a root B-Tree page, and to a lesser extent inner pages even
 under a lot of cache pressure, for example. There are lots of
 workloads in which that can happen, and I have a hard time believing
 that it's worth it to evict given the extraordinary difference in
 their utility as compared to a lot of other things. I can imagine a
 huge barrier against evicting what is actually a relatively tiny
 number of pages being worth it.

I'm making a general statement about a property that I think a buffer
eviction algorithm ought to have.  I actually didn't say anything
about the algorithm you've chosen one way or the other.  Obviously,
you've built in some protections against everything becoming hot, and
that's a good thing as far as it goes.  But you also have a greatly
increased risk of everything becoming cold.  All you need is a rate of
buffer eviction that circles shared_buffers more often than once every
3 seconds, and everything will gradually cool down until you once
again can't distinguish which stuff is hot from which stuff isn't.

 I don't want to dismiss what you're saying about heating and cooling
 being unrelated, but I don't find the conclusion that not everything
 can be hot obvious. Maybe heat should be relative rather than
 absolute, and maybe that's actually what you meant. There is surely
 some workload where buffer access actually is perfectly uniform, and
 what do you do there? What temperature are those buffers?

Obviously, some value lower than the maximum and higher than the
minimum.  If they're all at max temperature and then a new buffer (a
btree room or vm page, for example) comes along and is much hotter,
there's no room on the scale left to express that.  If they're all at
min temperature and then a new buffer comes along that is just used
once and thrown out, there's no room left on the scale for that buffer
to emerge as a good candidate for eviction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Ants Aasma
On Wed, Apr 16, 2014 at 7:44 AM, Robert Haas robertmh...@gmail.com wrote:
 I think that the basic problem here is that usage counts increase when
 buffers are referenced, but they decrease when buffers are evicted,
 and those two things are not in any necessary way connected to each
 other.  In particular, if no eviction is happening, reference counts
 will converge to the maximum value.  I've read a few papers about
 algorithms that attempt to segregate the list of buffers into hot
 and cold lists, and an important property of such algorithms is that
 they mustn't be allowed to make everything hot.  It's easy to be too
 simplistic, here: an algorithm that requires that no more than half
 the list be hot will fall over badly on a workload where the working
 set exceeds the available cache and the really hot portion of the
 working set is 60% of the available cache.  So you need a more
 sophisticated algorithm than that.  But that core property that not
 all buffers can be hot must somehow be preserved, and our algorithm
 doesn't.

FWIW in CLOCK-Pro segregating buffers between hot and cold is tied to
eviction and the clock sweep, the ratio between hot and cold is
dynamically adapted based on prior experience. The main downside is
that it seems to require an indirection somewhere either in the clock
sweep or buffer lookup. Maybe it's possible to avoid that with some
clever engineering if we think hard enough.

CLOCK-Pro may also have too little memory of hotness, making it too
easy to blow the whole cache away with a burst of activity. It may be
useful to have a (possibly tunable) notion of fairness where one
query/backend can't take over the cache even though it may be an
overall win in terms of total number of I/Os performed. Maybe we need
to invent Generalized CLOCK-Pro with a larger number of levels,
ranging from cold, hot and scalding to infernal. :)

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Merlin Moncure
On Tue, Apr 15, 2014 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, Apr 16, 2014 at 5:00 AM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma a...@cybertec.at wrote:
 There's a paper on a non blocking GCLOCK algorithm, that does lock
 free clock sweep and buffer pinning[7]. If we decide to stay with
 GCLOCK it may be interesting, although I still believe that some
 variant of buffer nailing[8] is a better idea, my experience shows
 that most of the locking overhead is cache line bouncing ignoring the
 extreme cases where our naive spinlock implementation blows up.

 You might be right about that, but lets handle one problem at a time.
 Who knows what the bottleneck will end up being if and when we address
 the naivety around frequency? I want to better characterize that
 problem first.

 Just to summarize you about the previous discussion and the
 improvements that we decided to do in this area based on feedback
 are as follows:

 1. Bgwriter needs to be improved so that it can help in reducing
 usage count and finding next victim buffer (run the clock sweep
 and add buffers to the free list).
 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
 are less.
 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
 (a spinlock for the freelist, and an lwlock for the clock sweep).
 Here we can try to make it lock free based on atomic ops as
 well.
 4. Bgwriter needs to be more aggressive, logic based on which it
 calculates how many buffers it needs to process needs to be
 improved.
 5. Contention around buffer mapping locks.
 6. Cacheline bouncing around the buffer header spinlocks, is there
 anything we can do to reduce this?
 7. Choose Optimistically used buffer in StrategyGetBuffer().
 8. Don't bump the usage count every time buffer is pinned.

What about:  9. Don't wait on locked buffer in the clock sweep.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Andres Freund
On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote:
  1. Bgwriter needs to be improved so that it can help in reducing
  usage count and finding next victim buffer (run the clock sweep
  and add buffers to the free list).
  2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
  are less.
  3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
  (a spinlock for the freelist, and an lwlock for the clock sweep).
  Here we can try to make it lock free based on atomic ops as
  well.
  4. Bgwriter needs to be more aggressive, logic based on which it
  calculates how many buffers it needs to process needs to be
  improved.
  5. Contention around buffer mapping locks.
  6. Cacheline bouncing around the buffer header spinlocks, is there
  anything we can do to reduce this?
  7. Choose Optimistically used buffer in StrategyGetBuffer().
  8. Don't bump the usage count every time buffer is pinned.
 
 What about:  9. Don't wait on locked buffer in the clock sweep.

I don't think we do that? Or are you referring to locked buffer headers?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Merlin Moncure
On Tue, Apr 15, 2014 at 11:44 PM, Robert Haas robertmh...@gmail.com wrote:
 I think that the basic problem here is that usage counts increase when
 buffers are referenced, but they decrease when buffers are evicted,
 and those two things are not in any necessary way connected to each
 other.  In particular, if no eviction is happening, reference counts
 will converge to the maximum value.  I've read a few papers about
 algorithms that attempt to segregate the list of buffers into hot
 and cold lists, and an important property of such algorithms is that
 they mustn't be allowed to make everything hot.  It's easy to be too
 simplistic, here: an algorithm that requires that no more than half
 the list be hot will fall over badly on a workload where the working
 set exceeds the available cache and the really hot portion of the
 working set is 60% of the available cache.  So you need a more
 sophisticated algorithm than that.  But that core property that not
 all buffers can be hot must somehow be preserved, and our algorithm
 doesn't.

A while back you sketched out an idea that did something like that:
hotly accessed buffers became 'perma-pinned' such that they no longer
participated in the clock sweep for eviction and there was a side-line
process that did a two stage eviction (IIRC) from the super hot stack
in order to mitigate locking.  This idea had a couple of nice
properties:

1) very hot buffers no longer get refcounted, reducing spinlock
contention (which has been documented in real world workloads)
2) eviction loop shrinks.  although you still have to check the 'very
hot' flag, thats an unlocked check (again, IIRC) and no further
processing is done.

The downside of this approach was complexity and difficult to test for
edge case complexity.  I would like to point out though that while i/o
efficiency gains are nice, I think contention issues are the bigger
fish to fry.

On Mon, Apr 14, 2014 at 12:11 PM, Peter Geoghegan p...@heroku.com wrote:
 1) Throttles incrementation of usage_count temporally. It becomes
 impossible to increment usage_count for any given buffer more
 frequently than every 3 seconds, while decrementing usage_count is
 totally unaffected.

hm, that's expensive.  how about a heuristic based on the number of
buffer allocations and the size of the buffer pool?

On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote:
 What about:  9. Don't wait on locked buffer in the clock sweep.

 I don't think we do that? Or are you referring to locked buffer headers?

Right -- exactly.  I posted patch for this a while back. It's quite
trivial: implement a trylock variant of the buffer header lock macro
and further guard the check with a non-locking test (which TAS()
already does generally, but the idea is to avoid the cache line lock
in likely cases of contention).  I believe this to be unambiguously
better: even if it's self healing or unlikely, there is no good reason
to jump into a spinlock fray or even request a contented cache line
while holding a critical lock.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Andres Freund
On 2014-04-16 08:25:23 -0500, Merlin Moncure wrote:
 The downside of this approach was complexity and difficult to test for
 edge case complexity.  I would like to point out though that while i/o
 efficiency gains are nice, I think contention issues are the bigger
 fish to fry.

That's my feeling as well.

 
 On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote:
  What about:  9. Don't wait on locked buffer in the clock sweep.
 
  I don't think we do that? Or are you referring to locked buffer headers?
 
 Right -- exactly.  I posted patch for this a while back. It's quite
 trivial: implement a trylock variant of the buffer header lock macro
 and further guard the check with a non-locking test (which TAS()
 already does generally, but the idea is to avoid the cache line lock
 in likely cases of contention).  I believe this to be unambiguously
 better: even if it's self healing or unlikely, there is no good reason
 to jump into a spinlock fray or even request a contented cache line
 while holding a critical lock.

IIRC you had problems proving the benefits of that, right?

I think that's because the locking times of buffer headers are short
enough that it's really unlikely to read a locked buffer header
spinlock. The spinlock acquiration will have made the locker the
exclusive owner of the spinlock in the majority of cases, and as soon as
that happens the cache miss/transfer will take far longer than the lock
takes.

I think this is the wrong level to optimize things. Imo there's two
possible solutions (that don't exclude each other):

* perform the clock sweep in one process so there's a very fast way to
  get to a free buffer. Possibly in a partitioned way.

* Don't take a global exclusive lock while performing the clock
  sweep. Instead increase StrategyControl-nextVictimBuffer in chunks
  under an exclusive lock, and then scan the potential victim buffers in
  those chunks without a global lock held.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >