Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 20, 2015 at 2:53 PM, Jim Nasby jim.na...@bluetreble.com wrote: I think that would help, but it still leaves user backends trying to advance the clock, which is quite painful. Has anyone tested running the clock in the background? We need a wiki page with all the ideas that have been tested around buffer management... Amit's bgreclaimer patch did that, but we weren't able to demonstrate a clear benefit. I haven't given up on the idea yet, but we've got to be able to prove that it's a good idea in practice as well as in theory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 7:02 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan p...@heroku.com wrote: Why is that good? We did discuss this before. I've recapped some of what I believe to be the most salient points below. I think that people were all too quick to dismiss the idea of a wall time interval playing some role here (at least as a defense against correlated references, as a correlated reference period). I suppose that that's because it doesn't fit with an intuition that says that that kind of interval ought to be derived algebraically - magic delay settings are considered suspect. Yep, Tom gave that reason here: http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us But there was also this point from Andres - gettimeofday is not free: http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de And this point from me - this can degrade to random eviction under high pressure: http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com You'll notice that my proposal avoids all three of those objections. All I'm saying is that you shouldn't dismiss the idea without trying it out properly. The LRU-K paper, from the early 1990s, recommends this, and gettimeofday() calls were a lot more expensive back then. I'm sure that there is a way to overcome these issues if it turns out to be worth it (by amortizing gettimeofday() calls by driving it from an auxiliary process like the bgwriter, for example). In fact, I'm almost certain there is, because at least one other major database system uses just such a reference period. Back when ARC (or was it 2Q?) was committed before being reverted shortly thereafter, there was a similar idea actually implemented, but with XIDs preventing correlated references (which the LRU-K paper also hints at). I think that an actual delay is more robust than that, though. Even though this correlated reference delay is all I implemented with my prototype,it's just one piece of the puzzle. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 20, 2015 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout klep...@svana.org wrote: I've been following this thread from the side with interest and got twigged by the point about loss of information. If you'd like better information about relative ages, you can acheive this by raising the cap on the usage count and dividing (or right-shifting) each sweep. Yeah, I thought about that, too. It might be worth experimenting with. This would allow you to remember much more about about the relative worth of often used pages. With a cap of 32 you'd have the same effect as now where after 5 sweeps the buffer is evicted. Mathematically the count would converge to the number of times the block is used per sweep. Hmm, interesting point. It's possible that we'd still have problems with everything maxing out at 32 on some workloads, but at least it'd be a little harder to max out at 32 than at 5. Do we have any reproducible test cases to evaluate these assumptions? I haven't looked at this stuff for a while, but my main issue with the clock sweep was finding sweep heavy cases that did not also have trouble with the buffer mapping locks (although the facts on the ground my have changed since then). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout klep...@svana.org wrote: I've been following this thread from the side with interest and got twigged by the point about loss of information. If you'd like better information about relative ages, you can acheive this by raising the cap on the usage count and dividing (or right-shifting) each sweep. Yeah, I thought about that, too. It might be worth experimenting with. This would allow you to remember much more about about the relative worth of often used pages. With a cap of 32 you'd have the same effect as now where after 5 sweeps the buffer is evicted. Mathematically the count would converge to the number of times the block is used per sweep. Hmm, interesting point. It's possible that we'd still have problems with everything maxing out at 32 on some workloads, but at least it'd be a little harder to max out at 32 than at 5. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote: This is my point though (you're right that flushed isn't always the same as eviction but that's not the important point here). Right now we only demote when we consider buffers for eviction. But we promote when we pin buffers. Those two things aren't necessarily happening at the same rate and in fact are often orders of magnitude different. I am absolutely, positively, violently in 100% agreement with this. I have made the same point before, but it sure is nice to hear someone else thinking about it the same way. So it makes sense that we end up with a lot of buffers promoted all the way to the most recently used ntile and then have to do n passes around the clock and have no good information about which buffer to evict. Right. What I'm saying is that we should demote a buffer every time we promote a buffer. So every time we pin a buffer we should advance the clock a corresponding amount. I know I'm being intentionally vague about what the corresponding amount is.) The important thing is that the two should be tied together. Yes, absolutely. If you tilt your head the right way, my proposal of limiting the number of promotions per clock sweep has the effect of tying buffer demotion and buffer promotion together much more tightly than is the case right now. You are limited to 2 promotions per demotion; and practically speaking not all buffers eligible to be promoted will actually get accessed, so the number of promotions per demotion will in reality be somewhere between 0 and 2. Ideally it would be exactly 1, but 1 +/- 1 is still a tighter limit than we have at present. Which is not to say there isn't some other idea that is better still. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 20, 2015 at 11:00 AM, Merlin Moncure mmonc...@gmail.com wrote: Hmm, interesting point. It's possible that we'd still have problems with everything maxing out at 32 on some workloads, but at least it'd be a little harder to max out at 32 than at 5. Do we have any reproducible test cases to evaluate these assumptions? The particular point that you are responding to here is easy to reproduce. Just create any workload that fits in shared_buffers - a small pgbench database, for example - and access stuff. No usage counts will go down, but every access will drive usage counts up. Eventually everything will hit any maximum you care to install, and actually I don't think it takes very long. You can use pg_buffercache to see the results. I haven't looked at this stuff for a while, but my main issue with the clock sweep was finding sweep heavy cases that did not also have trouble with the buffer mapping locks (although the facts on the ground my have changed since then). We increased the number of buffer mapping locks to 128, so that problem should be considerably ameliorated now. But it's not totally gone, as demonstrated by Andres's experiments with my chash patch. There was also a patch that eliminated BufFreelistLock in favor of a spinlock held for much shorter time periods, and then Andres took that one step further and made it use atomics. That used to be a *terrible* bottleneck on eviction-heavy workloads and is now much improved. More work may remain to be done, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/20/15 11:11 AM, Robert Haas wrote: On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote: This is my point though (you're right that flushed isn't always the same as eviction but that's not the important point here). Right now we only demote when we consider buffers for eviction. But we promote when we pin buffers. Those two things aren't necessarily happening at the same rate and in fact are often orders of magnitude different. I am absolutely, positively, violently in 100% agreement with this. I have made the same point before, but it sure is nice to hear someone else thinking about it the same way. +1 What I'm saying is that we should demote a buffer every time we promote a buffer. So every time we pin a buffer we should advance the clock a corresponding amount. I know I'm being intentionally vague about what the corresponding amount is.) The important thing is that the two should be tied together. Yes, absolutely. If you tilt your head the right way, my proposal of limiting the number of promotions per clock sweep has the effect of tying buffer demotion and buffer promotion together much more tightly than is the case right now. You are limited to 2 promotions per demotion; and practically speaking not all buffers eligible to be promoted will actually get accessed, so the number of promotions per demotion will in reality be somewhere between 0 and 2. Ideally it would be exactly 1, but 1 +/- 1 is still a tighter limit than we have at present. Which is not to say there isn't some other idea that is better still. I think that would help, but it still leaves user backends trying to advance the clock, which is quite painful. Has anyone tested running the clock in the background? We need a wiki page with all the ideas that have been tested around buffer management... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 10:07 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila amit.kapil...@gmail.com wrote: IIUC, this will allow us to increase usage count only when the buffer is touched by clocksweep to decrement the usage count. Yes. I think such a solution will be good for the cases when many evictions needs to be performed to satisfy the workload, OTOH when there are not too many evictions that needs to be done, in such a case some of the buffers that are accessed much more will have equal probability to get evicted as compare to buffers which are less accessed. Possibly, but I think it's even worse under the current algorithm. Under this proposal, if we go for a long time without any buffer evictions, every buffer usage's count will top out at 2 more than wherever it was after the last clock sweep. In the worst case, every buffer (or most of them) could end up with the same usage count. But under the status quo, they'll all go to 5, which is an even bigger loss of information, and which will make the first eviction much more expensive than if they are all pegged at 2 or 3. Okay, got your point. On further thinking on this idea, it occurred to me that with this idea you are trying to reduce the clock-sweep time which in-turn will inturn improve the chances of finding usable buffer in less time under high pressure, if that is true then I think we should have seen the similar gain even with the bgreclaimer idea which I have tried sometime back, because that has also reduced the time for clock-sweep. There could be ways to improve things further, such as by slowly advancing the clock sweep even when no eviction is required. But that's a bit tricky to do, too. You'd like (perhaps) to advance it one step for every buffer allocation, but you don't want a centralized counter, or unnecessary contention on the clock sweep when no eviction is necessary in the first place. There's probably some way to make it work, though, if we put our mind to it. I'm inclined to try the simpler approach first and see what it buys. Sure, I think it is worth a try. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 12:37:44AM -0400, Robert Haas wrote: I think such a solution will be good for the cases when many evictions needs to be performed to satisfy the workload, OTOH when there are not too many evictions that needs to be done, in such a case some of the buffers that are accessed much more will have equal probability to get evicted as compare to buffers which are less accessed. Possibly, but I think it's even worse under the current algorithm. Under this proposal, if we go for a long time without any buffer evictions, every buffer usage's count will top out at 2 more than wherever it was after the last clock sweep. In the worst case, every buffer (or most of them) could end up with the same usage count. But under the status quo, they'll all go to 5, which is an even bigger loss of information, and which will make the first eviction much more expensive than if they are all pegged at 2 or 3. I've been following this thread from the side with interest and got twigged by the point about loss of information. If you'd like better information about relative ages, you can acheive this by raising the cap on the usage count and dividing (or right-shifting) each sweep. This would allow you to remember much more about about the relative worth of often used pages. With a cap of 32 you'd have the same effect as now where after 5 sweeps the buffer is evicted. Mathematically the count would converge to the number of times the block is used per sweep. If you wanted to be really clever, you could at the beginning of each sweep take an estimate of the number of buffers used since the last sweep (from the stats collector perhaps) and use that to drive your divisor, so if you have a lots of allocations you become more aggressive about reducing the counts. Or if the load is light fall back to just subtracting one. Then you don't need a cap at all. (Apologies if this has been suggested before, Google didn't find anything for me). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 5:26 AM, Robert Haas robertmh...@gmail.com wrote: The way our cache works we promote when a buffer is accessed but we only demote when a buffer is flushed. We flush a lot less often than we touch buffers so it's not surprising that the cache ends up full of buffers that are all in the most recently used section. This isn't really correct. We promote when it's accessed, but we demote it when the clock sweep hand passes over it, which happens each time we consider it for eviction. It does not have to do with flushing dirty date to disk, and it does not happen only when the buffer is actually evicted. This is my point though (you're right that flushed isn't always the same as eviction but that's not the important point here). Right now we only demote when we consider buffers for eviction. But we promote when we pin buffers. Those two things aren't necessarily happening at the same rate and in fact are often orders of magnitude different. So it makes sense that we end up with a lot of buffers promoted all the way to the most recently used ntile and then have to do n passes around the clock and have no good information about which buffer to evict. What I'm saying is that we should demote a buffer every time we promote a buffer. So every time we pin a buffer we should advance the clock a corresponding amount. I know I'm being intentionally vague about what the corresponding amount is.) The important thing is that the two should be tied together. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan p...@heroku.com wrote: Why is that good? We did discuss this before. I've recapped some of what I believe to be the most salient points below. I think that people were all too quick to dismiss the idea of a wall time interval playing some role here (at least as a defense against correlated references, as a correlated reference period). I suppose that that's because it doesn't fit with an intuition that says that that kind of interval ought to be derived algebraically - magic delay settings are considered suspect. Yep, Tom gave that reason here: http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us But there was also this point from Andres - gettimeofday is not free: http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de And this point from me - this can degrade to random eviction under high pressure: http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com You'll notice that my proposal avoids all three of those objections. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 2:55 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: Anyways, I'm still curious if you can post similar numbers basing the throttling on gross allocation counts instead of time. Meaning: some number of buffer allocations has to have occurred before you consider eviction. Besides being faster I think it's a better implementation: an intermittently loaded server will give more consistent behavior. Yeah --- I think wall-clock-based throttling is fundamentally the wrong thing anyway. Are we going to start needing a CPU speed measurement to tune the algorithm with? Not the place to be going. But driving it off the number of allocations that've been done could be sensible. (OTOH, that means you need a central counter, which itself would be a bottleneck.) So, I was thinking about this a little bit more today, prodded by my coworker John Gorman. I'm wondering if we could drive this off of the clock sweep; that is, every time the clock sweep touches a buffer, its usage count goes down by one, but we also set two flag bits. Every time the buffer gets touched thereafter, we check whether any flag bits are set; if so, we clear one and increase the usage count, else we do nothing. So the usage count can increase at most twice per clock sweep. The advantage of that is that, as with Peter's approach, it is harder for the usage count of a buffer to max out - to get there, you need sustained access over a longer period of time. But it's not time-dependent, so replaying the same workload at twice the speed or half the speed on faster or slower hardware doesn't change the choice of which buffer to evict, which seems good. And it will cause us to prefer buffers which are accessed repeatedly over a period of time rather than buffers that are accessed a bunch of times in quick succession and then not touched again for a while, which seems like a good bet. IIUC, this will allow us to increase usage count only when the buffer is touched by clocksweep to decrement the usage count. I think such a solution will be good for the cases when many evictions needs to be performed to satisfy the workload, OTOH when there are not too many evictions that needs to be done, in such a case some of the buffers that are accessed much more will have equal probability to get evicted as compare to buffers which are less accessed. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 7:10 PM, Greg Stark st...@mit.edu wrote: The way the clock sweep algorithm is meant to be thought about is that it's an approximate lru. Each usage count corresponds to an ntile of the lru. So we don't know which buffer is least recently used but it must be in the set of buffers with usage count 0 and that should be 1/nth of all the buffers. Agreed. In order for that property to be maintained though the usage count for some buffer should be getting decremented every time we touch a buffer. That is, every time we promote one buffer to the most recently moved ntile we should be demoting some other buffer. Agreed. It's not easy to get this behavior exactly, though, because the buffer you kick out necessarily has a usage count of 0 and the one you bring in probably shouldn't. And we don't wanna have to run the clock sweep every time somebody touches a non-maximal usage count. But I think it is still true that this is, to some degree, what our algorithm is trying to approximate, and I also think it's pretty clear that our current approximation isn't that great. The way our cache works we promote when a buffer is accessed but we only demote when a buffer is flushed. We flush a lot less often than we touch buffers so it's not surprising that the cache ends up full of buffers that are all in the most recently used section. This isn't really correct. We promote when it's accessed, but we demote it when the clock sweep hand passes over it, which happens each time we consider it for eviction. It does not have to do with flushing dirty date to disk, and it does not happen only when the buffer is actually evicted. Now it's complicated by the fact that we aren't promoting buffers directly to the most recently used ntile. We're incrementing the usage count by one. That makes it more of a least frequently used list rather than a lru. I think that's a mistake but I recall some debate about that when it first went in. Note that the discussion of 2Q, LRU(k), and perhaps others ask not only how recently the page was used, but how frequently it was used. The recency of the next-to-last access is often used as a proxy for frequency. Consider two buffers. One gets touched 5 times in a row, once a day. The other gets touched 5 times per day, at equal intervals. In general, the second buffer is a better choice to retain than the first, even if it has been touched less recently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila amit.kapil...@gmail.com wrote: IIUC, this will allow us to increase usage count only when the buffer is touched by clocksweep to decrement the usage count. Yes. I think such a solution will be good for the cases when many evictions needs to be performed to satisfy the workload, OTOH when there are not too many evictions that needs to be done, in such a case some of the buffers that are accessed much more will have equal probability to get evicted as compare to buffers which are less accessed. Possibly, but I think it's even worse under the current algorithm. Under this proposal, if we go for a long time without any buffer evictions, every buffer usage's count will top out at 2 more than wherever it was after the last clock sweep. In the worst case, every buffer (or most of them) could end up with the same usage count. But under the status quo, they'll all go to 5, which is an even bigger loss of information, and which will make the first eviction much more expensive than if they are all pegged at 2 or 3. There could be ways to improve things further, such as by slowly advancing the clock sweep even when no eviction is required. But that's a bit tricky to do, too. You'd like (perhaps) to advance it one step for every buffer allocation, but you don't want a centralized counter, or unnecessary contention on the clock sweep when no eviction is necessary in the first place. There's probably some way to make it work, though, if we put our mind to it. I'm inclined to try the simpler approach first and see what it buys. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/14/15 5:22 PM, Peter Geoghegan wrote: As long as we're doing random brainstorming, I'd suggest looking at making clocksweep actually approximate LRU-K/LRU-2 (which, again, to be clear, my prototype did not do). The clocksweep could maintain statistics about the recency of the second-to-last access across all buffers, and discriminate against buffers according to what bucket of the population they fit in to. Not sure how aggressively we'd penalize those buffers that had very old penultimate references (or credit those that had very recent penultimate references), or what the bucket partitioning scheme is, but that's probably where'd I'd take it next. For example, buffers with a penultimate reference that is more than a standard deviation below the mean would be double penalized (and maybe the opposite, for those buffers with penultimate accesses a stddev above the mean). If that didn't work so well, then I'd look into an ARC style recency and frequency list (while remembering things about already evicted blocks, which LRU-K does not doalthough that paper is from the early 1990s). Along the lines of brainstorming... why do we even allow usage_count 1? Clocksweep was used pretty successfully by at least FreeBSD, but they simply used a bit to indicate recently used. Anything that wasn't recently used moved from the active pull to the inactive pool (which tended to be far larger than the active pool with decent amounts of memory), and a small number of buffers were keep on the 'free' list by pulling them out of the inactive pool and writing them if they were dirty. All of this was done on an LRU basis. Given how common it is for the vast bulk of shared_buffers in an install to be stuck at 5, I'd think the first thing we should try is a combination of greatly reducing the max for usage_count (maybe to 2 instead of 1 to simulate 2 pools), and running the clock sweep a lot more aggressively in a background process. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: So, I was thinking about this a little bit more today, prodded by my coworker John Gorman. I'm wondering if we could drive this off of the clock sweep; that is, every time the clock sweep touches a buffer, its usage count goes down by one, but we also set two flag bits. Every time the buffer gets touched thereafter, we check whether any flag bits are set; if so, we clear one and increase the usage count, else we do nothing. So the usage count can increase at most twice per clock sweep. The advantage of that is that, as with Peter's approach, it is harder for the usage count of a buffer to max out - to get there, you need sustained access over a longer period of time. But it's not time-dependent, so replaying the same workload at twice the speed or half the speed on faster or slower hardware doesn't change the choice of which buffer to evict, which seems good. Why is that good? My little prototype basically implemented the LRU-K approach to preventing correlated references (which tpc-b has some of, in a really obvious way - not sure how significant that is). This would have accidentally made it weight frequency better, but the prototype did not pretend to actually implement something like LRU-K (which is not the state of the art in any case), because it did not consider the recency of the second-to-last access of the buffer at all. The prototype's performance started off well, but regressed in later pgbench-collector iterations (due to the influence of VACUUM, I think). If I wanted to cheat, I could have only done one large 45 minute run, which would have made the prototype look much better still. And it will cause us to prefer buffers which are accessed repeatedly over a period of time rather than buffers that are accessed a bunch of times in quick succession and then not touched again for a while, which seems like a good bet. I think that people were all too quick to dismiss the idea of a wall time interval playing some role here (at least as a defense against correlated references, as a correlated reference period). I suppose that that's because it doesn't fit with an intuition that says that that kind of interval ought to be derived algebraically - magic delay settings are considered suspect. However, the same can be said of the Five-minute rule, which is a highly influential rule of thumb that Jim Gray came up with. The Five minute rule is now obsolete, but that took a long time, and the fundamental observation still applies (Wikipedia says it depends on what type of disks you have these days, but the fact remains that rule of thumbs like this can be more or less robust). I think that correlated reference type delays *are* used effectively in real world systems without it being fragile/overly workload dependent. I can't convince myself that this fully solves the (currently existing) problem of usage counts increasing too fast. In theory, every time the clock sweep hand advances, it decrements the usage count of one buffer by one, but also makes it possible for the usage count of that buffer to increase by up to two (or by only one if the buffer previously had a usage count of five). So with the right access pattern, it seems like you could still get to a place where a typical buffer eviction has to do a lot of scanning before it finds a victim buffer. As with the present algorithm, we'd be most likely to have that problem when the buffer hit ratio is very high, so that there are many opportunities for usage counts to go up between each opportunity to push usage counts back down. But it seems like it would at least limit the damage. I haven't tried this or anything, so this is just random brainstorming. That's what it'll take, I think -- random brainstorming, and crude prototypes to test theories. As long as we're doing random brainstorming, I'd suggest looking at making clocksweep actually approximate LRU-K/LRU-2 (which, again, to be clear, my prototype did not do). The clocksweep could maintain statistics about the recency of the second-to-last access across all buffers, and discriminate against buffers according to what bucket of the population they fit in to. Not sure how aggressively we'd penalize those buffers that had very old penultimate references (or credit those that had very recent penultimate references), or what the bucket partitioning scheme is, but that's probably where'd I'd take it next. For example, buffers with a penultimate reference that is more than a standard deviation below the mean would be double penalized (and maybe the opposite, for those buffers with penultimate accesses a stddev above the mean). If that didn't work so well, then I'd look into an ARC style recency and frequency list (while remembering things about already evicted blocks, which LRU-K does not doalthough that paper is from the early 1990s). There are approaches to relieving lock
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: Anyways, I'm still curious if you can post similar numbers basing the throttling on gross allocation counts instead of time. Meaning: some number of buffer allocations has to have occurred before you consider eviction. Besides being faster I think it's a better implementation: an intermittently loaded server will give more consistent behavior. Yeah --- I think wall-clock-based throttling is fundamentally the wrong thing anyway. Are we going to start needing a CPU speed measurement to tune the algorithm with? Not the place to be going. But driving it off the number of allocations that've been done could be sensible. (OTOH, that means you need a central counter, which itself would be a bottleneck.) So, I was thinking about this a little bit more today, prodded by my coworker John Gorman. I'm wondering if we could drive this off of the clock sweep; that is, every time the clock sweep touches a buffer, its usage count goes down by one, but we also set two flag bits. Every time the buffer gets touched thereafter, we check whether any flag bits are set; if so, we clear one and increase the usage count, else we do nothing. So the usage count can increase at most twice per clock sweep. The advantage of that is that, as with Peter's approach, it is harder for the usage count of a buffer to max out - to get there, you need sustained access over a longer period of time. But it's not time-dependent, so replaying the same workload at twice the speed or half the speed on faster or slower hardware doesn't change the choice of which buffer to evict, which seems good. And it will cause us to prefer buffers which are accessed repeatedly over a period of time rather than buffers that are accessed a bunch of times in quick succession and then not touched again for a while, which seems like a good bet. I can't convince myself that this fully solves the (currently existing) problem of usage counts increasing too fast. In theory, every time the clock sweep hand advances, it decrements the usage count of one buffer by one, but also makes it possible for the usage count of that buffer to increase by up to two (or by only one if the buffer previously had a usage count of five). So with the right access pattern, it seems like you could still get to a place where a typical buffer eviction has to do a lot of scanning before it finds a victim buffer. As with the present algorithm, we'd be most likely to have that problem when the buffer hit ratio is very high, so that there are many opportunities for usage counts to go up between each opportunity to push usage counts back down. But it seems like it would at least limit the damage. I haven't tried this or anything, so this is just random brainstorming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
I've been meaning to write this since PGConf and now isn't a great time since I'm on my phone but I think it's time. The way the clock sweep algorithm is meant to be thought about is that it's an approximate lru. Each usage count corresponds to an ntile of the lru. So we don't know which buffer is least recently used but it must be in the set of buffers with usage count 0 and that should be 1/nth of all the buffers. In order for that property to be maintained though the usage count for some buffer should be getting decremented every time we touch a buffer. That is, every time we promote one buffer to the most recently moved ntile we should be demoting some other buffer. The way our cache works we promote when a buffer is accessed but we only demote when a buffer is flushed. We flush a lot less often than we touch buffers so it's not surprising that the cache ends up full of buffers that are all in the most recently used section. Now it's complicated by the fact that we aren't promoting buffers directly to the most recently used ntile. We're incrementing the usage count by one. That makes it more of a least frequently used list rather than a lru. I think that's a mistake but I recall some debate about that when it first went in.
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Jim Nasby j...@nasby.net wrote: In our case this could maybe be handled by simply not incrementing counts when there's no eviction... but I'm more a fan of separate polls/clocks, because that means you can do things like a LFU for active and an LRU for inactive. I have hesitated to mention some benchmarks I did for optimal caching techniques for a database load, because they were so old, but maybe the ideas might spark something of value in the discussion. I'm talking about early 1985 on 80286 hardware on DOS with a Terminate and Stay Resident (TSR) cache external to the database. The external cache used LRU caching, and I was looking at what caching I could do inside the database to improve real database workloads which tended to include both OLTP and reporting. I found two types of caches improved performance. Neither was a substitute for the LRU cache closer to the hardware, and eliminating either reduced performance over having both. One was index-specific -- each connection caused to be held in cache the last page at each level of the index. This proved useful because in our real life applications it turned out that the next random access on an index was very often the same or near the previous. The other was a weighted average of access counts -- each access bumped a count and after a certain number of bumps all counts were reduced by 25%. This was accomplished by setting each count to the sum of it's existing value shifted right by one and shifted right by two. I understand that with the much larger RAM caches available 30 years later there could be some problems with passing all the counts atomically without causing a noticeable pause, and the higher connection counts may cause more contention issues. But if those issues could be solved (or somehow dodged for a proof of concept benchmark) it might be interesting to see how that worked out. FWIW, I recall that we used a one byte counter for each page, running 0 to 255. I don't recall the number at which we effectively multiplied by 0.75, and there was nothing particularly magic about that multiplier other than it was pretty fast and worked better that 0.5 in my benchmarks. I also don't remember what we used as the initial value on a page load. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Kevin Grittner kgri...@ymail.com wrote: each connection caused to be held in cache the last page at each level of the index. Apologies for ambiguous terminology there. To be clear: the most recently accessed page at each level of the index. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Fri, Apr 18, 2014 at 11:46 AM, Greg Stark st...@mit.edu wrote: On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas robertmh...@gmail.com wrote: I am a bit confused by this remark. In *any* circumstance when you evict you're incurring precisely one page fault I/O when the page is read back in. That doesn't mean that the choice of which page to evict is irrelevant. But you might be evicting a page that will be needed soon or one that won't be needed for a while. If it's not needed for a while you might be able to avoid many page evictions by caching a page that will be used several times. Sure. If all the pages currently in RAM are hot -- meaning they're hot enough that they'll be needed again before the page you're reading in -- then they're all equally bad to evict. Also true. But the problem is that it is very rarely, if ever, the case that all pages are *equally* hot. On a pgbench workload, for example, I'm very confident that while there's not really any cold data, the btree roots and visibility map pages are a whole lot hotter than a randomly-selected heap page. If you evict a heap page, you're going to need it back pretty quick, because it won't be long until the random-number generator again chooses a key that happens to be located on that page. But if you evict the root of the btree index, you're going to need it back *immediately*, because the very next query, no matter what key it's looking for, is going to need that page. I'm pretty sure that's a significant difference. I'm trying to push us away from the gut instinct that frequently used pages are important to cache and towards actually counting how many i/os we're saving. In the extreme it's possible to simulate any cache algorithm on a recorded list of page requests and count how many page misses it generates to compare it with an optimal cache algorithm. There's another issue, which Simon clued me into a few years back: evicting the wrong page can cause system-wide stalls. In the pgbench case, evicting a heap page will force the next process that chooses a random number that maps to a tuple on that page to wait for the page to be faulted back in. That's sad, but unless the scale factor is small compared to the number of backends, there will probably be only ONE process waiting. On the other hand, if we evict the btree root, within a fraction of a second, EVERY process that isn't already waiting on some other I/O will be waiting for that I/O to complete. The impact on throughput is much bigger in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby j...@nasby.net wrote: I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. Well, there's either memory pressure or there isn't. If there isn't then it's all moot *because we're not evicting anything*. I don't think that's really true. A workload can fit within shared_buffers at some times and spill beyond it at others. Every time it fits within shared_buffers for even a short period of time, the reference count of any buffer that's not ice-cold goes to 5 and we essentially lose all knowledge of which buffers are relatively hotter. Then, when we spill out again, evictions are random. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 28, 2014 at 6:02 AM, Robert Haas robertmh...@gmail.com wrote: Also true. But the problem is that it is very rarely, if ever, the case that all pages are *equally* hot. On a pgbench workload, for example, I'm very confident that while there's not really any cold data, the btree roots and visibility map pages are a whole lot hotter than a randomly-selected heap page. If you evict a heap page, you're going to need it back pretty quick, because it won't be long until the random-number generator again chooses a key that happens to be located on that page. But if you evict the root of the btree index, you're going to need it back *immediately*, because the very next query, no matter what key it's looking for, is going to need that page. I'm pretty sure that's a significant difference. I emphasized leaf pages because even with master the root and inner pages are still going to be so hot as to make them constantly in cache, at least with pgbench's use of a uniform distribution. You'd have to have an absolutely enormous scale factor before this might not be the case. As such, I'm not all that worried about inner pages when performing these simple benchmarks. However, in the case of the pgbench_accounts table, each of the B-Tree leaf pages that comprise about 99.5% of the total is still going to be about six times more frequently accessed than each heap page. That's a small enough difference for it to easily go unappreciated, and yet a big enough difference for it to hurt a lot. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Fri, Apr 25, 2014 at 10:45 AM, Peter Geoghegan p...@heroku.com wrote: I've now done a non-limited comparative benchmark of master against the patch (once again, with usage_count starting at 6, and BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again, the distribution threshold used was consistently 5.0, causing the patched pgbench to report for each test: transaction type: Custom query scaling factor: 5000 standard deviation threshold: 5.0 access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741 Results are available from: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/ I updated this with various changes in bgwriter configuration. Perhaps unsurprisingly, disabling the background writer entirely helps for both master and patched. It is perhaps notable that the largest difference between two comparable patch + master test runs is seen when the background writer is disabled entirely, and 32 clients. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/28/14, 8:04 AM, Robert Haas wrote: On Mon, Apr 21, 2014 at 6:38 PM, Jim Nasby j...@nasby.net wrote: I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. Well, there's either memory pressure or there isn't. If there isn't then it's all moot *because we're not evicting anything*. I don't think that's really true. A workload can fit within shared_buffers at some times and spill beyond it at others. Every time it fits within shared_buffers for even a short period of time, the reference count of any buffer that's not ice-cold goes to 5 and we essentially lose all knowledge of which buffers are relatively hotter. Then, when we spill out again, evictions are random. That's a separate problem, but yes, just because we're not evicting something doesn't mean we can end up with every buffer marked as equally important. OSes handle this by splitting pages between active and inactive, and maintaining a relative balance between the two (actually a bit more complex because there's a separate inactive/dirty pool). In our case this could maybe be handled by simply not incrementing counts when there's no eviction... but I'm more a fan of separate polls/clocks, because that means you can do things like a LFU for active and an LRU for inactive. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
I've now done a non-limited comparative benchmark of master against the patch (once again, with usage_count starting at 6, and BM_MAX_USAGE_COUNT at 30) with a Gaussian distribution. Once again, the distribution threshold used was consistently 5.0, causing the patched pgbench to report for each test: transaction type: Custom query scaling factor: 5000 standard deviation threshold: 5.0 access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741 Results are available from: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-gauss/ -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 21, 2014 at 11:57 PM, Peter Geoghegan p...@heroku.com wrote: Here is a benchmark that is similar to my earlier one, but with a rate limit of 125 tps, to help us better characterize how the prototype patch helps performance: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/ I've added some more test sets to this result report, again with a 125 TPS limit, but on this occasion with a pgbench Gaussian distribution. I used V13 of the recently proposed Guassian distribution pgbench patch [1] to accomplish this, including the Gaussian variant of tpc-b that the pgbench patch has baked in. The distribution threshold used was consistently 5, causing the patched pgbench to report for each test: transaction type: Custom query scaling factor: 5000 standard deviation threshold: 5.0 access probability of top 20%, 10% and 5% records: 0.68269 0.38293 0.19741 It looks like the patch continues to have much lower latency than master for this somewhat distinct workload. Actually, even though the background writer is somewhat working harder than in the uniform distribution case, the average latency with patched is appreciably lower. Total buffers allocated are just as consistent as before for patched, but the number is markedly lower than for the prior uniform distribution case. Dirty memory graphs start off similar to the uniform case with patched, but get a bit spikier towards the end of each test run there. It's still *markedly* better than master for either distribution type, which is still really aggressive at times for master, and other times by far isn't aggressive enough, in much the same way as before. In general, with the Gaussian distribution, average latency is lower, but worst case is higher. The patch maintains its clear lead for average case, albeit a smaller lead than with uniform, and with worst case things are much better relatively speaking. Absolute worst case (and not worst case averaged across client counts) is 1.4 seconds with patched, to 8.3 with master...and that terrible worst case happens *twice* with master. For uniform distribution, the same figure was 5.4 - 5.8 seconds for master, and 0.6 seconds for patched. What is curious is that with master and with the Gaussian distribution, I see distinct latency no man's land in multiple test runs, like this one here for example: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/49/index.html . It looks like there is a clear differentiation between going to disk and not going to disk, or something like that. I don't see this for any other case, and it's quite obviously a consistent and distinct feature of master + Gaussian when the OS isn't aggressively writing out a mountain of dirty memory. This is something that I personally have never seen before. I also note that master had 3 huge background writer spikes with a Gaussian distribution, rather than 2 and 1 small one, as was (consistently) demonstrated to happen with a uniform distribution. What's more, 90th percentile latency is very consistent across client counts for the new patched test run, as opposed to being very much higher with higher client counts when master is tested. [1] http://www.postgresql.org/message-id/alpine.DEB.2.10.1404011107220.2557@sto -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Here is a benchmark that is similar to my earlier one, but with a rate limit of 125 tps, to help us better characterize how the prototype patch helps performance: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/ Again, these are 15 minute runs with unlogged tables at multiple client counts, and scale 5,000. Every test run should have managed to hit that limit, but in the case of master 1 test did not. I have included vmstat, iostat and meminfo OS instrumentation this time around, which is really interesting for this particular limit based benchmark. The prototype patch tested here is a slight refinement on my earlier prototype. Apart from reducing the number of gettimeofday() calls, and doing them out of the critical path, I increased the initial usage_count value to 6. I also set BM_MAX_USAGE_COUNT to 30. I guess I couldn't resist the temptation to tweak things, which I actually did very little of prior to publishing my initial results. This helps, but there isn't a huge additional benefit. The benchmark results show that master cannot even meet the 125 tps limit on 1 test out of 9. More interestingly, the background writer consistently cleans about 10,000 buffers per test run when testing the patch. At the same time, buffers are allocated at a very consistent rate of around 262,000 for the patched test runs. Leaving aside the first test run, with the patch there is only a tiny variance in the number cleaned between each test, a variance of just a few hundred buffers. In contrast, master has enormous variance. During just over half of the tests, the background writer does not clean even a single buffer. Then, on 2 tests out of 9, it cleans an enormous ~350,000 buffers. The second time this happens leads to master failing to even meet the 125 tps limit (albeit with only one client). If you drill down to individual test runs, a similar pattern is evident. You'll now find operating system information (meminfo dirty memory) graphed here. The majority of the time, master does not hold more than 1,000 kB of dirty memory at a time. Once or twice it's 0 kB for multiple minutes. However, during the test runs where we also see huge spikes in background-writer-cleaned pages, we also see huge spikes in the total amount of dirty memory (and correlated huge spikes in latency). It can get to highs of ~700,000 kB at one point. In contrast, the patched tests show very consistent amounts of dirty memory. Per test, it almost always tops out at 4,000 kB - 6,000 kB (there is a single 12,000 kB spike, though). There is consistently a distinct zig-zag pattern to the dirty memory graph with the patched tests, regardless of client count or where checkpoints occur. Master shows mountains and valleys for those two particularly problematic tests, correlating with a panicked background writer's aggressive feedback loop. Master also shows less rhythmic zig-zag patterns that only peak at about 600 kB - 1,000 kB for the entire duration of many individual test runs. Perhaps most notably, average and worst case latency is far improved with the patch. On average it's less than half of master with 1 client, and less than a quarter of master with 32 clients. I think that the rate limiting feature of pgbench is really useful for characterizing how work like this improves performance. I see a far smoother and more consistent pattern of I/O that superficially looks like Postgres is cooperating with the operating system much more than it does in the baseline. It sort of makes sense that the operating system cache doesn't care about frequency while Postgres does. If the OS cache did weigh frequency, it would surely not alter the outcome very much, since OS cached data has presumably not been accessed very frequently recently. I suspect that I've cut down on double buffering by quite a bit. I would like to come up with a simple way of measuring that, using something like pgfincore, but the available interfaces don't seem well-suited to quantifying how much of a problem this is and remains. I guess call pgfincore on the index segment files might be interesting, since shared_buffers mostly holds index pages. This has been verified using pg_buffercache. It would be great to test this out with something involving non-uniform distributions, like Gaussian and Zipfian distributions. The LRU-K paper tests Zipfian too. The uniform distribution pgbench uses here, while interesting, doesn't tell the full story at all and is less representative of reality (TPB-C is formally required to have a non-uniform distribution [1] for some things, for example). A Gaussian distribution might show essentially the same failure to properly credit pages with frequency of access in one additional dimension, so to speak. Someone should probably look at the TPC-C-like DBT-2, since in the past that was considered to be a problematic workload for PostgreSQL [2] due to the heavy I/O. Zipfian is a lot less sympathetic than uniform if the
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Jason Petersen wrote: Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype. The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system. Would it make sense to just cache the result of the latest gettimeofday() call and use that as an approximation for wall time? The busier the system is, the more accurate that should be. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 22, 2014 at 12:59 PM, Albe Laurenz laurenz.a...@wien.gv.atwrote: Jason Petersen wrote: Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype. The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system. Would it make sense to just cache the result of the latest gettimeofday() call and use that as an approximation for wall time? The busier the system is, the more accurate that should be. That sounds...risky. How will the invalidation/updation of the cache work? How will we track the time window in which the cached value is still valid and applicable? My first thoughts only. I may be missing the point though. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 04/17/2014 10:39 PM, Andres Freund wrote: On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote: Just over 99.6% of pages (leaving aside the meta page) in the big 10 GB pgbench_accounts_pkey index are leaf pages. What is the depth of b-tree at this percentage ? Cheers Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/21/14, 6:07 PM, David G Johnston wrote: Jim Nasby-2 wrote I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. Well, there's either memory pressure or there isn't. If there isn't then it's all moot*because we're not evicting anything*. The trade-off I'm seeing here is between measuring when there is no memory pressure - and thus eating at performance while not actually evicting buffers - and not measuring but then encountering memory pressure and not having a clue as to what should be evicted. Right. OSes handle this by keeping a certain ratio of active vs inactive pages, regardless of pressure for free pages. That way when you need more pages in the free list you can pull them from the inactive list knowing that you're making a good decision. One of the really nice things about this approach is that if memory pressure is low enough that you don't need more pages on the inactive list you don't even need to run that clock. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 22, 2014 at 2:03 AM, Hannu Krosing ha...@krosing.net wrote: What is the depth of b-tree at this percentage ? Well, this percentage of B-Tree pages that are leaf pages doesn't have much to do with the depth. The percentage seems very consistent for each B-Tree, irrespective of the total size of the B-Tree. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/15/14, 1:15 PM, Peter Geoghegan wrote: On Tue, Apr 15, 2014 at 9:30 AM, Merlin Moncuremmonc...@gmail.com wrote: There are many reports of improvement from lowering shared_buffers. The problem is that it tends to show up on complex production workloads and that there is no clear evidence pointing to problems with the clock sweep; it could be higher up in the partition locks or something else entirely (like the O/S). pgbench is also not the greatest tool for sniffing out these cases: it's too random and for large database optimization is generally an exercise in de-randomizing i/o patterns. We really, really need a broader testing suite that covers more usage patterns. I find it quite dissatisfying that we know so little about this. This is an area where additional stats gathering would be very valuable. We're running 8G buffers on 512G servers because bumping it up hurt performance, but we have no clue why. Was it due to buffer pins? How many times the clock had to sweep to find a victim? Something else entirely? No idea... :( -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/18/14, 2:51 PM, Atri Sharma wrote: I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. Well, there's either memory pressure or there isn't. If there isn't then it's all moot *because we're not evicting anything*. One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannot traverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldest one. This is why OSes use multiple page pools. If we're going to use a clock sweep at all I think we need to use the same. Every time we discuss this stuff it feels like we're completely reinventing the wheel that was solved by OSes years ago. :( -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/16/14, 10:28 AM, Robert Haas wrote: Also, I think the scalability problems around buffer eviction are eminently solvable, and in particular I'm hopeful that Amit is going to succeed in solving them. Suppose we have a background process (whether the background writer or some other) that runs the clock sweep, identifies good candidates for eviction, and pushes them on a set of, say, 16 free-lists protected by spinlocks. (The optimal number of free-lists probably depends on the size of shared_buffers.) How *certain* are we that a single freelist lock (that actually ONLY protects the freelist) would be that big a deal? I suspect it wouldn't be much of an issue at all: - Right now (IIRC) it's tied into the clock as well, so immediate fail on scaling... - The clock is WAY more expensive than grabbing one buffer off the free list. Last I looked it was so bad that even if the next buffer the clock hit was free it was still worse than hitting the free list. I strongly suspect that a single freelist lock (that didn't protect anything else) would be fine. I think it'd be folly to start with a more complex multi-lock/multi-freelist implementation before we knew we needed one. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Jim Nasby-2 wrote I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. Well, there's either memory pressure or there isn't. If there isn't then it's all moot *because we're not evicting anything*. The trade-off I'm seeing here is between measuring when there is no memory pressure - and thus eating at performance while not actually evicting buffers - and not measuring but then encountering memory pressure and not having a clue as to what should be evicted. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Clock-sweep-not-caching-enough-B-Tree-leaf-pages-tp5799947p5800988.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 21, 2014 at 8:07 PM, David G Johnston david.g.johns...@gmail.com wrote: Jim Nasby-2 wrote I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. Well, there's either memory pressure or there isn't. If there isn't then it's all moot *because we're not evicting anything*. The trade-off I'm seeing here is between measuring when there is no memory pressure - and thus eating at performance while not actually evicting buffers - and not measuring but then encountering memory pressure and not having a clue as to what should be evicted. I believe that for the intended use discussed in this thread, a compile-time switch would be more than enough control, and it would avoid that tradeoff. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Jim Nasby j...@nasby.net writes: How *certain* are we that a single freelist lock (that actually ONLY protects the freelist) would be that big a deal? We used to have one. It was a big bottleneck --- and this was years ago, when the buffer manager was much less scalable than it is today. (IIRC, getting rid of a central lock was one of the main advantages of the current clock sweep code over its predecessor.) The real issue here is that in the modern code, we hardly ever actually have anything in the freelist: only when a relation is dropped, or something like that, do buffers ever get put to the freelist. So your argument that removing a buffer from the freelist is cheaper than running the clock sweep is largely missing the point. We'd have to run a clock sweep in order to find something to put in the freelist. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: We used to have one. It was a big bottleneck --- and this was years ago, when the buffer manager was much less scalable than it is today. (IIRC, getting rid of a central lock was one of the main advantages of the current clock sweep code over its predecessor.) Yes, it was. This is a major advantage of clock sweep, and anything that replaces it will need to maintain the same advantage. Didn't someone indicate that clock sweep could beat ARC around that time, presumably for this reason? If no one did, then my reading of a variety of other papers on caching indicates that this is probably the case. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Peter Geoghegan p...@heroku.com writes: On Mon, Apr 21, 2014 at 5:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: We used to have one. It was a big bottleneck --- and this was years ago, when the buffer manager was much less scalable than it is today. (IIRC, getting rid of a central lock was one of the main advantages of the current clock sweep code over its predecessor.) Yes, it was. This is a major advantage of clock sweep, and anything that replaces it will need to maintain the same advantage. Didn't someone indicate that clock sweep could beat ARC around that time, presumably for this reason? If no one did, then my reading of a variety of other papers on caching indicates that this is probably the case. ARC *was* the predecessor algorithm. See commit 5d5087363. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: ARC *was* the predecessor algorithm. See commit 5d5087363. I believe that the main impetus for replacing ARC with clock sweep came from patent issues, though. It was a happy coincidence that clock sweep happened to be better than ARC, but that doesn't mean that ARC didn't have some clear advantages, even if it wasn't worth it on balance. LRU-K, and 2Q have roughly the same advantages. I'm reasonably confident you can have the best of both worlds, or something closer to it. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Peter Geoghegan p...@heroku.com writes: On Mon, Apr 21, 2014 at 5:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: ARC *was* the predecessor algorithm. See commit 5d5087363. I believe that the main impetus for replacing ARC with clock sweep came from patent issues, though. That was one issue, but performance gains were a large part of it too, and the main reason why we picked clock sweep rather than something else. Did you read the commit message I pointed to? (See also 4e8af8d27.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 21, 2014 at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: Did you read the commit message I pointed to? Yes. (See also 4e8af8d27.) Oh, I wasn't actually aware of the fact that 2Q made it into the tree. I thought that the first commit message you referred to just referenced on-list discussion of 2Q. Interesting. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 21, 2014 at 5:59 PM, Peter Geoghegan p...@heroku.com wrote: LRU-K, and 2Q have roughly the same advantages. I'm reasonably confident you can have the best of both worlds, or something closer to it. Having said that, a big part of what I'd like to accomplish here is to address the more general problem of correlated references. That's probably something that has independent value. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Sat, Apr 19, 2014 at 3:37 AM, Bruce Momjian br...@momjian.us wrote: One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannot traverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldest one. I thought if there was memory pressure the clock sweep would run and we wouldn't have everything at the max counter access value. Hmm, I see your point. With that applicable as well, I feel that the clocksweep counting/logical clock system shall be useful when deciding between multiple candidates for eviction. At worst, it can serve to replace the gettimeofday() calls. One thing I have thought of with ideas and inputs from Joshua Yanowski offline is that we can probably have a maxheap which is on the logical clock age of buffers. Each time clocksweep sees a buffer whose refcount has become zero, it will push the buffer into minheap. This can be a new representation of freelist or a new additional data structure. This still does not solve the problem of seeing the entire list by the clocksweep, even if that makes the eviction process O(1) with the addition of the maxheap. I am working on a PoC patch but am stuck on this point. My current approach sees the entire shared buffers list to search for any candidate buffers. Another thing that is a pain point here is the concurrency and locking overheads of introducing a new data structure. Can the existing buffer header spinlock handle this problem or is it hitting the granularity of the spinlock too much? I see some blockers for this idea still. Nevertheless, the point of clocksweep counts as logical clocks seems to be promising,atleast intuitively. Thoughts and comments? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Fri, Apr 18, 2014 at 7:27 AM, Peter Geoghegan p...@heroku.com wrote: A way I have in mind about eviction policy is to introduce a way to have an ageing factor in each buffer and take the ageing factor into consideration when evicting a buffer. Consider a case where a table is pretty huge and spread across multiple pages. The querying pattern is like a time series pattern i.e. a set of rows is queried pretty frequently for some time, making the corresponding page hot. Then, the next set of rows is queried frequently making that page hot and so on. Consider a new page entering the shared buffers with refcount 1 and usage_count 1. If that page is a part of the workload described above, it is likely that it shall not be used for a considerable amount of time after it has entered the buffers but will be used eventually. Now, the current hypothetical situation is that we have three pages: 1) The page that used to be hot at the previous time window but is no longer hot and is actually the correct candidate for eviction. 2) The current hot page (It wont be evicted anyway for now). 3) The new page which just got in and should not be evicted since it can be hot soon (for this workload it will be hot in the next time window). When Clocksweep algorithm runs the next time, it will see the new buffer page as the one to be evicted (since page (1) may still have usage_count 0 i.e. it may be 'cooling' but not 'cool' yet.) This can be changed by introducing an ageing factor that sees how much time the current buffer has spend in shared buffers. If the time that the buffer has spent is large enough (relatively) and it is not hot currently, that means it has had its chance and can be evicted. This shall save the new page (3) from being evicted since it's time in shared buffers shall not be high enough to mandate eviction and it shall be given more chances. Since gettimeofday() is an expensive call and hence cannot be done in the tight loop, we can count the number of clocksweeps the current buffer has seen (rather, survived). This shall give us a rough idea of the estimate of the relative age of the buffer. When an eviction happens, all the candidates with refcount = 0 shall be taken.Then, among them, the one with highest ageing factor shall be evicted. Of course, there may be better ways of doing the same, but I want to highlight the point (or possibility) of introducing an ageing factor to prevent eviction of relatively younger pages early in the eviction process. The overhead isnt too big. We just need to add another attribute in buffer header for the number of clocksweeps seen (rather, survived) and check it when an eviction is taking place.The existing spinlock for buffer headers shall be good for protecting contention and access. The access rules can be similar to that of usage_count. Thoughts and comments? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 5:00 PM, Greg Stark st...@mit.edu wrote: On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas robertmh...@gmail.com wrote: Because all the usage counts are the same, the eviction at this point is completely indiscriminate. We're just as likely to kick out a btree root page or a visibility map page as we are to kick out a random heap page, even though the former have probably been accessed several orders of magnitude more often. That's clearly bad. That's not clear at all. In that circumstance regardless of what page you evict you're incurring precisely one page fault i/o when the page is read back in. I am a bit confused by this remark. In *any* circumstance when you evict you're incurring precisely one page fault I/O when the page is read back in. That doesn't mean that the choice of which page to evict is irrelevant. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas robertmh...@gmail.com wrote: I am a bit confused by this remark. In *any* circumstance when you evict you're incurring precisely one page fault I/O when the page is read back in. That doesn't mean that the choice of which page to evict is irrelevant. But you might be evicting a page that will be needed soon or one that won't be needed for a while. If it's not needed for a while you might be able to avoid many page evictions by caching a page that will be used several times. If all the pages currently in RAM are hot -- meaning they're hot enough that they'll be needed again before the page you're reading in -- then they're all equally bad to evict. I'm trying to push us away from the gut instinct that frequently used pages are important to cache and towards actually counting how many i/os we're saving. In the extreme it's possible to simulate any cache algorithm on a recorded list of page requests and count how many page misses it generates to compare it with an optimal cache algorithm. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote: This can be changed by introducing an ageing factor that sees how much time the current buffer has spend in shared buffers. If the time that the buffer has spent is large enough (relatively) and it is not hot currently, that means it has had its chance and can be evicted. This shall save the new page (3) from being evicted since it's time in shared buffers shall not be high enough to mandate eviction and it shall be given more chances. Since gettimeofday() is an expensive call and hence cannot be done in the tight loop, we can count the number of clocksweeps the current buffer has seen (rather, survived). This shall give us a rough idea of the estimate of the relative age of the buffer. Counting clock sweeps is an intersting idea. I think one concern was tracking hot buffers in cases where there is no memory pressure, and hence the clock sweep isn't running --- I am not sure how this would help in that case. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Sat, Apr 19, 2014 at 1:07 AM, Bruce Momjian br...@momjian.us wrote: On Fri, Apr 18, 2014 at 04:46:31PM +0530, Atri Sharma wrote: This can be changed by introducing an ageing factor that sees how much time the current buffer has spend in shared buffers. If the time that the buffer has spent is large enough (relatively) and it is not hot currently, that means it has had its chance and can be evicted. This shall save the new page (3) from being evicted since it's time in shared buffers shall not be high enough to mandate eviction and it shall be given more chances. Since gettimeofday() is an expensive call and hence cannot be done in the tight loop, we can count the number of clocksweeps the current buffer has seen (rather, survived). This shall give us a rough idea of the estimate of the relative age of the buffer. Counting clock sweeps is an intersting idea. I think one concern was tracking hot buffers in cases where there is no memory pressure, and hence the clock sweep isn't running --- I am not sure how this would help in that case. I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannot traverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldest one. Any thoughts there would be appreciated. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Apr 18, 2014, at 1:51 PM, Atri Sharma atri.j...@gmail.com wrote: Counting clock sweeps is an intersting idea. I think one concern was tracking hot buffers in cases where there is no memory pressure, and hence the clock sweep isn't running --- I am not sure how this would help in that case. Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype. The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system. What’s wrong with using a transaction number or some similar sequence? It would accurately track “age” in the sense we care about: how long ago in “units of real work being done by the DB” something was added. —Jason
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype. The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system. What’s wrong with using a transaction number or some similar sequence? It would accurately track “age” in the sense we care about: how long ago in “units of real work being done by the DB” something was added. Well, AIUI, we only need the 'relative' age of buffers in relation to the youngest buffer present. So, the guy who has seen the maximum amount of clocksweeps is the guy who has been around the most. I do not see a need for an accurate estimate of the time spent in the buffer for any purpose right now. It may be useful in the future though. How do you get the transaction ID? By accessing a tuple on the page and reading it's XMIN? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Fri, Apr 18, 2014 at 1:11 PM, Jason Petersen ja...@citusdata.com wrote: Yes, we obviously want a virtual clock. Focusing on the use of gettimeofday seems silly to me: it was something quick for the prototype. The gettimeofday() call doesn't need to happen in a tight loop. It can be reasonably passed down from higher up, very probably without consequence. The LRU-K paper actually recommends a delay of 5 seconds. There is at least one other major database system that unambiguously uses a wall-clock delay of 3 seconds (by default) for this exact purpose - avoiding what the LRU-K paper calls correlated references. I'm not saying that we should go with that particular scheme for these reasons. However, it's plainly untrue that using wall time like this represents some kind of insurmountable scalability obstacle. The problem with the clocksweeps is they don’t actually track the progression of “time” within the PostgreSQL system. What’s wrong with using a transaction number or some similar sequence? It would accurately track “age” in the sense we care about: how long ago in “units of real work being done by the DB” something was added. The LRU-K paper suggests that a scheme like this could be preferable. I have my doubts that it can be made to work with Postgres. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Sat, Apr 19, 2014 at 01:21:29AM +0530, Atri Sharma wrote: I feel that if there is no memory pressure, frankly it doesnt matter much about what gets out and what not. The case I am specifically targeting is when the clocksweep gets to move about a lot i.e. high memory pressure workloads. Of course, I may be totally wrong here. One thing that I discussed with Merlin offline and am now concerned about is how will the actual eviction work. We cannot traverse the entire list and then find all the buffers with refcount 0 and then do another traversal to find the oldest one. I thought if there was memory pressure the clock sweep would run and we wouldn't have everything at the max counter access value. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas robertmh...@gmail.com wrote: This isn't a fundamental property of the usage-count idea; it's an artifact of the fact that usage count decreases are tied to eviction pressure rather than access pressure. For example, suppose we made a rule that if the total usage counts of all buffers exceed 3 * NBuffers, then every time you bump the usage count of a buffer from N to N+1, you're required to advance the clock sweep far enough to decrease the reference count of a buffer by one. This sounds like the right way to reason about it. From what I remember in school the idea with the clock sweep is to set the usage flags to the maximum whenever the buffer is used and decrement (actually iirc typically shift right) it when the clock sweep goes by. Ie, simulate a LRU where when the buffer is accessed it jumps to the head of the list and when the clock comes by it moves gradually down the list. What you're pointing out is that the clock might not come by very often resulting everything being at the head of the list. In that case I'm not clear it really matters what gets evicted though. And the cpu effort of running the clock n times sounds bad but doing the work earlier doesn't really change the amount of work being done, it just amortizes it over more calls. But if you want to do that it seems to me the way to do it is every time a buffer is pinned set to the maximum and then run the clock max_value - previous_value. So the total usage counts of all buffers remains constant. If that results in contention one way to reduce it is to do this probabilistically. Run the clock 1% of the time but run it 100x as much as you would normally. But I think you've misidentified the problem and what those other algorithms are trying to solve. The problem is not that Postgres will pick a bad buffer to evict. If all the buffers have been since the last time the clock came around then they're all hot anyways and it doesn't really matter which one we evict. The problem is that we expend an inordinate amount of work finding the few non-hot buffers. When you have a really large amount of memory and 99.9% of it is hot but 0.1% is whatever random non-hot page was needed last then there's an obvious buffer to evict when you need a new one. But we spend a lot of work decrementing every hot buffer's usage count 4 times only to have them immediately incremented again just to find the 1 buffer where the usage count was 4 or 3. The goal of these algorithms that divide the buffers into groups is to avoid having to do so much work to find the colder buffers. Once the hot buffers migrate to the hot pool we only need to run the clock there when we find we have new hot pages that we want to promote. All the thrashing in the cold pool can be more efficient because there's many fewer pages to consider. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 9:40 AM, Greg Stark st...@mit.edu wrote: On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas robertmh...@gmail.com wrote: This isn't a fundamental property of the usage-count idea; it's an artifact of the fact that usage count decreases are tied to eviction pressure rather than access pressure. For example, suppose we made a rule that if the total usage counts of all buffers exceed 3 * NBuffers, then every time you bump the usage count of a buffer from N to N+1, you're required to advance the clock sweep far enough to decrease the reference count of a buffer by one. This sounds like the right way to reason about it. From what I remember in school the idea with the clock sweep is to set the usage flags to the maximum whenever the buffer is used and decrement (actually iirc typically shift right) it when the clock sweep goes by. Ie, simulate a LRU where when the buffer is accessed it jumps to the head of the list and when the clock comes by it moves gradually down the list. What you're pointing out is that the clock might not come by very often resulting everything being at the head of the list. In that case I'm not clear it really matters what gets evicted though. And the cpu effort of running the clock n times sounds bad but doing the work earlier doesn't really change the amount of work being done, it just amortizes it over more calls. But if you want to do that it seems to me the way to do it is every time a buffer is pinned set to the maximum and then run the clock max_value - previous_value. So the total usage counts of all buffers remains constant. If that results in contention one way to reduce it is to do this probabilistically. Run the clock 1% of the time but run it 100x as much as you would normally. But I think you've misidentified the problem and what those other algorithms are trying to solve. The problem is not that Postgres will pick a bad buffer to evict. If all the buffers have been since the last time the clock came around then they're all hot anyways and it doesn't really matter which one we evict. The problem is that we expend an inordinate amount of work finding the few non-hot buffers. When you have a really large amount of memory and 99.9% of it is hot but 0.1% is whatever random non-hot page was needed last then there's an obvious buffer to evict when you need a new one. But we spend a lot of work decrementing every hot buffer's usage count 4 times only to have them immediately incremented again just to find the 1 buffer where the usage count was 4 or 3. The goal of these algorithms that divide the buffers into groups is to avoid having to do so much work to find the colder buffers. Once the hot buffers migrate to the hot pool we only need to run the clock there when we find we have new hot pages that we want to promote. All the thrashing in the cold pool can be more efficient because there's many fewer pages to consider. Well, I think Peter has proved that PostgreSQL *will* pick a bad buffer to evict. The proof is that when he changed the choice of buffer to evict, he got a significant performance improvement. I also believe this to be the case on first principles and my own experiments. Suppose you have a workload that fits inside shared_buffers. All of the usage counts will converge to 5. Then, somebody accesses a table that is not cached, so something's got to be evicted. Because all the usage counts are the same, the eviction at this point is completely indiscriminate. We're just as likely to kick out a btree root page or a visibility map page as we are to kick out a random heap page, even though the former have probably been accessed several orders of magnitude more often. That's clearly bad. On systems that are not too heavily loaded it doesn't matter too much because we just fault the page right back in from the OS pagecache. But I've done pgbench runs where such decisions lead to long stalls, because the page has to be brought back in from disk, and there's a long I/O queue; or maybe just because the kernel thinks PostgreSQL is issuing too many I/O requests and makes some of them wait to cool things down. Of course, the overhead of repeated clock sweeps to push down the usage counts isn't a great thing either. I'm not saying that isn't a problem. But I think bad decisions about what to evict are also a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote: I also believe this to be the case on first principles and my own experiments. Suppose you have a workload that fits inside shared_buffers. All of the usage counts will converge to 5. Then, somebody accesses a table that is not cached, so something's got to be evicted. Because all the usage counts are the same, the eviction at this point is completely indiscriminate. We're just as likely to kick out a btree root page or a visibility map page as we are to kick out a random heap page, even though the former have probably been accessed several orders of magnitude more often. That's clearly bad. On systems that are not too heavily loaded it doesn't matter too much because we just fault the page right back in from the OS pagecache. But I've done pgbench runs where such decisions lead to long stalls, because the page has to be brought back in from disk, and there's a long I/O queue; or maybe just because the kernel thinks PostgreSQL is issuing too many I/O requests and makes some of them wait to cool things down. I understand now. If there is no memory pressure, every buffer gets the max usage count, and when a new buffer comes in, it isn't the max so it is swiftly removed until the clock sweep has time to decrement the old buffers. Decaying buffers when there is no memory pressure creates additional overhead and gets into timing issues of when to decay. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian br...@momjian.us wrote: On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote: I also believe this to be the case on first principles and my own experiments. Suppose you have a workload that fits inside shared_buffers. All of the usage counts will converge to 5. Then, somebody accesses a table that is not cached, so something's got to be evicted. Because all the usage counts are the same, the eviction at this point is completely indiscriminate. We're just as likely to kick out a btree root page or a visibility map page as we are to kick out a random heap page, even though the former have probably been accessed several orders of magnitude more often. That's clearly bad. On systems that are not too heavily loaded it doesn't matter too much because we just fault the page right back in from the OS pagecache. But I've done pgbench runs where such decisions lead to long stalls, because the page has to be brought back in from disk, and there's a long I/O queue; or maybe just because the kernel thinks PostgreSQL is issuing too many I/O requests and makes some of them wait to cool things down. I understand now. If there is no memory pressure, every buffer gets the max usage count, and when a new buffer comes in, it isn't the max so it is swiftly removed until the clock sweep has time to decrement the old buffers. Decaying buffers when there is no memory pressure creates additional overhead and gets into timing issues of when to decay. That can happen, but the real problem I was trying to get at is that when all the buffers get up to max usage count, they all appear equally important. But in reality they're not. So when we do start evicting those long-resident buffers, it's essentially random which one we kick out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote: On Thu, Apr 17, 2014 at 10:32 AM, Bruce Momjian br...@momjian.us wrote: On Thu, Apr 17, 2014 at 10:18:43AM -0400, Robert Haas wrote: I also believe this to be the case on first principles and my own experiments. Suppose you have a workload that fits inside shared_buffers. All of the usage counts will converge to 5. Then, somebody accesses a table that is not cached, so something's got to be evicted. Because all the usage counts are the same, the eviction at this point is completely indiscriminate. We're just as likely to kick out a btree root page or a visibility map page as we are to kick out a random heap page, even though the former have probably been accessed several orders of magnitude more often. That's clearly bad. On systems that are not too heavily loaded it doesn't matter too much because we just fault the page right back in from the OS pagecache. But I've done pgbench runs where such decisions lead to long stalls, because the page has to be brought back in from disk, and there's a long I/O queue; or maybe just because the kernel thinks PostgreSQL is issuing too many I/O requests and makes some of them wait to cool things down. I understand now. If there is no memory pressure, every buffer gets the max usage count, and when a new buffer comes in, it isn't the max so it is swiftly removed until the clock sweep has time to decrement the old buffers. Decaying buffers when there is no memory pressure creates additional overhead and gets into timing issues of when to decay. That can happen, but the real problem I was trying to get at is that when all the buffers get up to max usage count, they all appear equally important. But in reality they're not. So when we do start evicting those long-resident buffers, it's essentially random which one we kick out. True. Ideally we would have some way to know that _all_ the buffers had reached the maximum and kick off a sweep to decrement them all. I am unclear how we would do that. One odd idea would be to have a global counter that is incremented everytime a buffer goes from 4 to 5 (max) --- when the counter equals 50% of all buffers, do a clock sweep. Of course, then the counter becomes a bottleneck. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 10:48 AM, Bruce Momjian br...@momjian.us wrote: I understand now. If there is no memory pressure, every buffer gets the max usage count, and when a new buffer comes in, it isn't the max so it is swiftly removed until the clock sweep has time to decrement the old buffers. Decaying buffers when there is no memory pressure creates additional overhead and gets into timing issues of when to decay. That can happen, but the real problem I was trying to get at is that when all the buffers get up to max usage count, they all appear equally important. But in reality they're not. So when we do start evicting those long-resident buffers, it's essentially random which one we kick out. True. Ideally we would have some way to know that _all_ the buffers had reached the maximum and kick off a sweep to decrement them all. I am unclear how we would do that. One odd idea would be to have a global counter that is incremented everytime a buffer goes from 4 to 5 (max) --- when the counter equals 50% of all buffers, do a clock sweep. Of course, then the counter becomes a bottleneck. Yeah, I think that's the right general line of thinking. But it doesn't have to be as coarse-grained as do a whole clock sweep. It can be, you know, for every buffer that gets incremented from 4 to 5, run the clock sweep far enough to decrement the usage count of some other buffer by one. That's similar to your idea but you can do it a bit at a time rather than having to make a complete pass over shared_buffers all at once. Your other point, that the counter can become the bottleneck, is quite right also and a major problem in this area. I don't know how to solve it right at the moment, but I'm hopeful that there may be a way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 2014-04-17 10:48:15 -0400, Bruce Momjian wrote: On Thu, Apr 17, 2014 at 10:40:40AM -0400, Robert Haas wrote: That can happen, but the real problem I was trying to get at is that when all the buffers get up to max usage count, they all appear equally important. But in reality they're not. So when we do start evicting those long-resident buffers, it's essentially random which one we kick out. True. Ideally we would have some way to know that _all_ the buffers had reached the maximum and kick off a sweep to decrement them all. I am unclear how we would do that. One odd idea would be to have a global counter that is incremented everytime a buffer goes from 4 to 5 (max) --- when the counter equals 50% of all buffers, do a clock sweep. Of course, then the counter becomes a bottleneck. I have my doubts that we'll make the current scheme, where buffer reclaim essentially is O(NBuffers), work much better. Especially as CPU cache effects make such large, high frequency, accesses really expensive. I think we need more drastic changes. I am *not* suggesting that we do that, but I believe it'd be possible to implement a full LRU and be faster than today in scenarios with nontrivial amounts of shared buffers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 10:18 AM, Robert Haas robertmh...@gmail.com wrote: Because all the usage counts are the same, the eviction at this point is completely indiscriminate. We're just as likely to kick out a btree root page or a visibility map page as we are to kick out a random heap page, even though the former have probably been accessed several orders of magnitude more often. That's clearly bad. That's not clear at all. In that circumstance regardless of what page you evict you're incurring precisely one page fault i/o when the page is read back in. Incurring that i/o is bad but it's unavoidable and it's the same badness regardless of what page it's for. The only way to prefer one page over another is if one page won't be needed for long enough for the page to be useful for caching this new buffer (or mixture of buffers) for multiple accesses. If you can't do that then it doesn't matter which buffer you use since it'll just be evicted to read back in the original page again. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 15, 2014 at 7:30 PM, Peter Geoghegan p...@heroku.com wrote: Frankly, there doesn't need to be any research on this, because it's just common sense that probabilistically, leaf pages are much more useful than heap pages in servicing index scan queries if we assume a uniform distribution. If we don't assume that, then they're still more useful on average. I don't think common sense is compelling. I think you need to pin down exactly what it is about btree intermediate pages that the LRU isn't capturing and not just argue they're more useful. The LRU is already capturing which pages are more heavily used than others so you need to identify what it is that makes index pages *even more* useful than their frequency and recency of access indicates. Not just that they're more useful than an average page. So what I think is missing is that indexes are always accessed from the root down to the leaf. So the most recent page accessed will always be the leaf. And in whatever chain of pages was used to reach the last leaf page the least recently accessed will always be the root. But we'll need the root page again on the subsequent descent even if it's to reach the same leaf page we kept in ram in preference to it. Now it doesn't *always* make sense to keep an intermediate page over leaf pages. Imagine an index that we always do full traversals of. We'll always descend from the root down the left-most pages and then follow the right pointers across. All the other intermediate pages will be cold. If we do an occasional descent probing for other keys those leaf pages shouldn't be cached since they won't be needed again for the common full index traversals and the next occasional probe will probably be looking for different keys. But if we're often probing for the same keys the last thing we want to do is throw away one of the intermediate pages for those keys when we could throw away a leaf page. But that's what would happen in a strict LRU. It's almost like what we would really want to do is mark the pages as least recently used in the opposite order from the order they're actually accessed when descending. Or perhaps bump the usage count to max+1 when it's an intermediate page so that it takes one extra cycle of decrementing before it's considered old compared to a leaf page. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
* Robert Haas (robertmh...@gmail.com) wrote: several orders of magnitude more often. That's clearly bad. On systems that are not too heavily loaded it doesn't matter too much because we just fault the page right back in from the OS pagecache. Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... But I've done pgbench runs where such decisions lead to long stalls, because the page has to be brought back in from disk, and there's a long I/O queue; or maybe just because the kernel thinks PostgreSQL is issuing too many I/O requests and makes some of them wait to cool things down. Exactly this. Of course, the overhead of repeated clock sweeps to push down the usage counts isn't a great thing either. I'm not saying that isn't a problem. But I think bad decisions about what to evict are also a problem. Using a bit more CPU here and there, particularly if it's done in a background worker, or ideally multiple background workers (for each buffer pool) would be much better than evicting a hot page that isn't in the kernel's buffer either 'cause we've held on to it long enough that the kernel thinks it's cold. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote: Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... This is a whole nother problem. It is worrisome that we could be benchmarking the page replacement algorithm in Postgres and choose a page replacement algorithm that chooses pages that performs well because it tends to evict pages that are in the OS cache. And then one day (hopefully not too far off) we'll fix the double buffering problem and end up with a strange choice of page replacement algorithm. It also means that every benchmark is super sensitive to the how large a fraction of system memory Postgres is managing. If A benchmark of a page replacement algorithm with 3GB shared buffers might perform well compared to others on a system with 8GB or 32GB total RAM but actually be choosing pages very poorly in normal terms and perform terribly on a system with 4GB total ram. Ideally what I would like to see is instrumentation of Postgres's buffer pinning so we can generate various test loads and then just run the different algorithms on them and measure precisely how many page evictions it's causing and when how often it's choosing pages that need to be read in soon after and so on. We shouldn't have to run Postgres to get these counts at all, just run the algorithm as we read through a text file (or database table) listing the pages being accessed. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 9:21 AM, Stephen Frost sfr...@snowman.net wrote: * Robert Haas (robertmh...@gmail.com) wrote: several orders of magnitude more often. That's clearly bad. On systems that are not too heavily loaded it doesn't matter too much because we just fault the page right back in from the OS pagecache. Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... But I've done pgbench runs where such decisions lead to long stalls, because the page has to be brought back in from disk, and there's a long I/O queue; or maybe just because the kernel thinks PostgreSQL is issuing too many I/O requests and makes some of them wait to cool things down. Exactly this. Yes, I believe that's why this is so effective. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
* Greg Stark (st...@mit.edu) wrote: On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote: Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... This is a whole nother problem. It is worrisome that we could be benchmarking the page replacement algorithm in Postgres and choose a page replacement algorithm that chooses pages that performs well because it tends to evict pages that are in the OS cache. And then one day (hopefully not too far off) we'll fix the double buffering problem and end up with a strange choice of page replacement algorithm. That's certainly possible but I don't see the double buffering problem going away any time particularly soon and, even if it does, it's likely to either a) mean we're just using the kernel's cache (eg: something w/ mmap, etc), or b) will involve so many other changes that this will end up getting changed anyway. In any case, while I think we should document any such cache management system we employ as having this risk, I don't think we should worry about it terribly much. It also means that every benchmark is super sensitive to the how large a fraction of system memory Postgres is managing. If A benchmark of a page replacement algorithm with 3GB shared buffers might perform well compared to others on a system with 8GB or 32GB total RAM but actually be choosing pages very poorly in normal terms and perform terribly on a system with 4GB total ram. I'm not following you here- benchmarks are already sensitive to how much of the system's memory PG is managing (and how much ends up being *dedicated* to PG's cache and therefore unavailable for other work). Ideally what I would like to see is instrumentation of Postgres's buffer pinning so we can generate various test loads and then just run the different algorithms on them and measure precisely how many page evictions it's causing and when how often it's choosing pages that need to be read in soon after and so on. We shouldn't have to run Postgres to get these counts at all, just run the algorithm as we read through a text file (or database table) listing the pages being accessed. Go for it. I'd love to see that also. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote: On 04/17/2014 09:38 PM, Stephen Frost wrote: * Greg Stark (st...@mit.edu) wrote: On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote: Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... This is a whole nother problem. It is worrisome that we could be benchmarking the page replacement algorithm in Postgres and choose a page replacement algorithm that chooses pages that performs well because it tends to evict pages that are in the OS cache. And then one day (hopefully not too far off) we'll fix the double buffering problem and end up with a strange choice of page replacement algorithm. That's certainly possible but I don't see the double buffering problem going away any time particularly soon and, even if it does, it's likely to either a) mean we're just using the kernel's cache (eg: something w/ mmap, etc), or b) will involve so many other changes that this will end up getting changed anyway. In any case, while I think we should document any such cache management system we employ as having this risk, I don't think we should worry about it terribly much. Note that if we somehow come up with a page replacement algorithm that tends to evict pages that are in the OS cache, we have effectively solved the double buffering problem. When a page is cached in both caches, evicting it from one of them eliminates the double buffering. Granted, you might prefer to evict it from the OS cache instead, and such an algorithm could be bad in other ways. But if a page replacement algorithm happens avoid double buffering, that's a genuine merit for that algorithm. I don't think it's a good idea to try to synchronize algorithms with the OSs. There's so much change about the caching logic in e.g. linux that it won't stay effective for very long. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
* Andres Freund (and...@2ndquadrant.com) wrote: Note that if we somehow come up with a page replacement algorithm that tends to evict pages that are in the OS cache, we have effectively solved the double buffering problem. When a page is cached in both caches, evicting it from one of them eliminates the double buffering. Granted, you might prefer to evict it from the OS cache instead, and such an algorithm could be bad in other ways. But if a page replacement algorithm happens avoid double buffering, that's a genuine merit for that algorithm. I don't think it's a good idea to try to synchronize algorithms with the OSs. There's so much change about the caching logic in e.g. linux that it won't stay effective for very long. There's also more than one OS... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 04/17/2014 09:38 PM, Stephen Frost wrote: * Greg Stark (st...@mit.edu) wrote: On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote: Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... This is a whole nother problem. It is worrisome that we could be benchmarking the page replacement algorithm in Postgres and choose a page replacement algorithm that chooses pages that performs well because it tends to evict pages that are in the OS cache. And then one day (hopefully not too far off) we'll fix the double buffering problem and end up with a strange choice of page replacement algorithm. That's certainly possible but I don't see the double buffering problem going away any time particularly soon and, even if it does, it's likely to either a) mean we're just using the kernel's cache (eg: something w/ mmap, etc), or b) will involve so many other changes that this will end up getting changed anyway. In any case, while I think we should document any such cache management system we employ as having this risk, I don't think we should worry about it terribly much. Note that if we somehow come up with a page replacement algorithm that tends to evict pages that are in the OS cache, we have effectively solved the double buffering problem. When a page is cached in both caches, evicting it from one of them eliminates the double buffering. Granted, you might prefer to evict it from the OS cache instead, and such an algorithm could be bad in other ways. But if a page replacement algorithm happens avoid double buffering, that's a genuine merit for that algorithm. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 1:48 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-04-17 21:44:47 +0300, Heikki Linnakangas wrote: On 04/17/2014 09:38 PM, Stephen Frost wrote: * Greg Stark (st...@mit.edu) wrote: On Thu, Apr 17, 2014 at 12:21 PM, Stephen Frost sfr...@snowman.net wrote: Ehhh. No. If it's a hot page that we've been holding in *our* cache long enough, the kernel will happily evict it as 'cold' from *its* cache, leading to... This is a whole nother problem. It is worrisome that we could be benchmarking the page replacement algorithm in Postgres and choose a page replacement algorithm that chooses pages that performs well because it tends to evict pages that are in the OS cache. And then one day (hopefully not too far off) we'll fix the double buffering problem and end up with a strange choice of page replacement algorithm. That's certainly possible but I don't see the double buffering problem going away any time particularly soon and, even if it does, it's likely to either a) mean we're just using the kernel's cache (eg: something w/ mmap, etc), or b) will involve so many other changes that this will end up getting changed anyway. In any case, while I think we should document any such cache management system we employ as having this risk, I don't think we should worry about it terribly much. Note that if we somehow come up with a page replacement algorithm that tends to evict pages that are in the OS cache, we have effectively solved the double buffering problem. When a page is cached in both caches, evicting it from one of them eliminates the double buffering. Granted, you might prefer to evict it from the OS cache instead, and such an algorithm could be bad in other ways. But if a page replacement algorithm happens avoid double buffering, that's a genuine merit for that algorithm. I don't think it's a good idea to try to synchronize algorithms with the OSs. There's so much change about the caching logic in e.g. linux that it won't stay effective for very long. No. but if you were very judicious, maybe you could hint the o/s (posix_fadvise) about pages that are likely to stay hot that you don't need them. I doubt that's necessary though -- if the postgres caching algorithm improves such that there is a better tendency for hot pages to stay in s_b, Eventually the O/S will deschedule the page for something else that needs it. In other words, otherwise preventable double buffering is really a measurement of bad eviction policy because it manifests in volatility of frequency accessed pages. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 11:53 AM, Merlin Moncure mmonc...@gmail.com wrote: No. but if you were very judicious, maybe you could hint the o/s (posix_fadvise) about pages that are likely to stay hot that you don't need them. Mitsumasa KONDO wrote a patch like that. I don't think the results were that promising, but things change quickly. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
* Merlin Moncure (mmonc...@gmail.com) wrote: I doubt that's necessary though -- if the postgres caching algorithm improves such that there is a better tendency for hot pages to stay in s_b, Eventually the O/S will deschedule the page for something else that needs it. In other words, otherwise preventable double buffering is really a measurement of bad eviction policy because it manifests in volatility of frequency accessed pages. I wonder if it would help to actually tell the OS to read in buffers that we're *evicting*... On the general notion that if the OS already has them buffered then it's almost a no-op, and if it doesn't and it's actually a 'hot' buffer that we're gonna need again shortly, the OS will have it. In other words, try to make the OS more like a secondary cache to ours by encouraging it to cache things we're evicting. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 2:00 PM, Stephen Frost sfr...@snowman.net wrote: * Merlin Moncure (mmonc...@gmail.com) wrote: I doubt that's necessary though -- if the postgres caching algorithm improves such that there is a better tendency for hot pages to stay in s_b, Eventually the O/S will deschedule the page for something else that needs it. In other words, otherwise preventable double buffering is really a measurement of bad eviction policy because it manifests in volatility of frequency accessed pages. I wonder if it would help to actually tell the OS to read in buffers that we're *evicting*... On the general notion that if the OS already has them buffered then it's almost a no-op, and if it doesn't and it's actually a 'hot' buffer that we're gonna need again shortly, the OS will have it. In other words, try to make the OS more like a secondary cache to ours by encouraging it to cache things we're evicting. I don't think this would work unless we would keep some kind of tracking information on the page itself which seems not worth a write operation to do (maybe if the page is dirtied it could be snuck in there though...). IOW, it would only make sense to do this if we knew that this page was likely to be read in again. This might be true in general on particular workloads but is probably a pretty flimsy assumption without supporting evidence; probably better to let the O/S deal with it. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Stephen Frost sfr...@snowman.net writes: I wonder if it would help to actually tell the OS to read in buffers that we're *evicting*... On the general notion that if the OS already has them buffered then it's almost a no-op, and if it doesn't and it's actually a 'hot' buffer that we're gonna need again shortly, the OS will have it. But if it's actually gone cold, you're just forcing unnecessary read I/O, not to mention possibly causing something slightly warmer to be lost from kernel cache. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
* Merlin Moncure (mmonc...@gmail.com) wrote: I don't think this would work unless we would keep some kind of tracking information on the page itself which seems not worth a write operation to do (maybe if the page is dirtied it could be snuck in there though...). IOW, it would only make sense to do this if we knew that this page was likely to be read in again. This might be true in general on particular workloads but is probably a pretty flimsy assumption without supporting evidence; probably better to let the O/S deal with it. The trouble is that we're ending up hiding the information from the OS about the frequency of utilization of that page. You have a good point and we wouldn't want to do this for pages that are just accessed once or similar, but perhaps just mark a page that's reached the 'max' as having been 'hot' and then, for those pages, advise the OS that while we're under pressure and need to push this page out, it was once pretty hottly used and therefore we may want it again soon. For pages that never reach the 'max' level, we wouldn't do anything on the assumption that those were only temporairly needed. Just some thoughts. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
* Tom Lane (t...@sss.pgh.pa.us) wrote: Stephen Frost sfr...@snowman.net writes: I wonder if it would help to actually tell the OS to read in buffers that we're *evicting*... On the general notion that if the OS already has them buffered then it's almost a no-op, and if it doesn't and it's actually a 'hot' buffer that we're gonna need again shortly, the OS will have it. But if it's actually gone cold, you're just forcing unnecessary read I/O, not to mention possibly causing something slightly warmer to be lost from kernel cache. Certainly possible- see the email I just sent about another thought around this. Obviously, none of these thoughts are really fully formed solutions and are, instead, just speculation and ideas. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 2:16 PM, Stephen Frost sfr...@snowman.net wrote: * Merlin Moncure (mmonc...@gmail.com) wrote: I don't think this would work unless we would keep some kind of tracking information on the page itself which seems not worth a write operation to do (maybe if the page is dirtied it could be snuck in there though...). IOW, it would only make sense to do this if we knew that this page was likely to be read in again. This might be true in general on particular workloads but is probably a pretty flimsy assumption without supporting evidence; probably better to let the O/S deal with it. The trouble is that we're ending up hiding the information from the OS about the frequency of utilization of that page. You have a good point and we wouldn't want to do this for pages that are just accessed once or similar, but perhaps just mark a page that's reached the 'max' as having been 'hot' and then, for those pages, advise the OS that while we're under pressure and need to push this page out, it was once pretty hottly used and therefore we may want it again soon. For pages that never reach the 'max' level, we wouldn't do anything on the assumption that those were only temporairly needed. yeah -- the thing is, we are already too spendy already on supplemental write i/o (hint bits, visible bits, freezing, etc) and likely not worth it to throw something else on the pile unless the page is already dirty; the medium term trend in storage is that read vs write performance is becoming increasingly asymmetric, particularly on the random side so it's very unlikely to balance out. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thursday, April 17, 2014, Merlin Moncure mmonc...@gmail.com wrote: yeah -- the thing is, we are already too spendy already on supplemental write i/o (hint bits, visible bits, freezing, etc) and likely not worth it to throw something else on the pile unless the page is already dirty; the medium term trend in storage is that read vs write performance is becoming increasingly asymmetric, particularly on the random side so it's very unlikely to balance out. Guess I wasn't clear but I was thinking to read the page in, not do any writing, and do it in a asynchronous way to the process doing the evicting. Thanks, Stephen
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 2:28 PM, Stephen Frost sfr...@snowman.net wrote: On Thursday, April 17, 2014, Merlin Moncure mmonc...@gmail.com wrote: yeah -- the thing is, we are already too spendy already on supplemental write i/o (hint bits, visible bits, freezing, etc) and likely not worth it to throw something else on the pile unless the page is already dirty; the medium term trend in storage is that read vs write performance is becoming increasingly asymmetric, particularly on the random side so it's very unlikely to balance out. Guess I wasn't clear but I was thinking to read the page in, not do any writing, and do it in a asynchronous way to the process doing the evicting. no -- I got you. My point was, that's a pure guess unless you base it on evidence recorded on the page itself. Without that evidence, (which requires writing) the operating is in a a better place to make that guess so it's probably better to defer that decision. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thursday, April 17, 2014, Merlin Moncure mmonc...@gmail.com wrote: no -- I got you. My point was, that's a pure guess unless you base it on evidence recorded on the page itself. Without that evidence, (which requires writing) the operating is in a a better place to make that guess so it's probably better to defer that decision. Well, we'd only need that info to be stored in the buffer cache somehow- wouldn't have to go to disk or cause more I/O, of course. My thinking was that we could track it with the existing counter too, avoiding even that small amount of locking to write to the buffer page. Thanks, Stephen
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 8:10 AM, Greg Stark st...@mit.edu wrote: I don't think common sense is compelling. I think you need to pin down exactly what it is about btree intermediate pages that the LRU isn't capturing and not just argue they're more useful. The LRU is already capturing which pages are more heavily used than others so you need to identify what it is that makes index pages *even more* useful than their frequency and recency of access indicates. Not just that they're more useful than an average page. See example 1.1 within the LRU-K paper. So what I think is missing is that indexes are always accessed from the root down to the leaf. So the most recent page accessed will always be the leaf. And in whatever chain of pages was used to reach the last leaf page the least recently accessed will always be the root. But we'll need the root page again on the subsequent descent even if it's to reach the same leaf page we kept in ram in preference to it. I can't imagine that this is much of a problem in practice. Consider the break-down of pages within indexes when pgbench scale is 5,000, as in my original benchmark: [local] pg@pgbench=# with tots as ( SELECT count(*) c, type, relname from (select relname, relpages, generate_series(1, relpages - 1) i from pg_class c join pg_namespace n on c.relnamespace = n.oid where relkind = 'i' and nspname = 'public') r, lateral (select * from bt_page_stats(relname, i)) u group by relname, type) select tots.relname, relpages -1 as non_meta_pages, c, c/sum(c) over(partition by tots.relname) as prop_of_index, type from tots join pg_class c on c.relname = tots.relname order by 2 desc, 1, type; relname| non_meta_pages |c| prop_of_index| type ---++-++-- pgbench_accounts_pkey |1370950 |4828 | 0.00352164557423684307 | i pgbench_accounts_pkey |1370950 | 1366121 | 0.99647762500455888253 | l pgbench_accounts_pkey |1370950 | 1 | 0.00729421204274408257 | r pgbench_tellers_pkey |274 | 273 | 0.99635036496350364964 | l pgbench_tellers_pkey |274 | 1 | 0.00364963503649635036 | r pgbench_branches_pkey | 28 | 27 | 0.96428571428571428571 | l pgbench_branches_pkey | 28 | 1 | 0.03571428571428571429 | r (7 rows) Time: 14562.297 ms Just over 99.6% of pages (leaving aside the meta page) in the big 10 GB pgbench_accounts_pkey index are leaf pages. The inner pages and root page are at an enormous advantage. In this example, the other indexes don't even have what would be separately classified as an inner page (and not a root page) at all, because it's perfectly sufficient to only have a root page to get to any one of, say, 273 leaf pages (in the case of pgbench_tellers_pkey here). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote: Just over 99.6% of pages (leaving aside the meta page) in the big 10 GB pgbench_accounts_pkey index are leaf pages. That's a rather nice number. I knew it was big, but I'd have guessed it'd be a percent lower. Do you happen to have the same stat handy for a sensibly wide text or numeric real world index? It'd be interesting to see what the worst case there is. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 1:39 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-04-17 13:33:27 -0700, Peter Geoghegan wrote: Just over 99.6% of pages (leaving aside the meta page) in the big 10 GB pgbench_accounts_pkey index are leaf pages. That's a rather nice number. I knew it was big, but I'd have guessed it'd be a percent lower. Yes, it's usually past 99.5% for int4. It's really bad if it's as low as 96%, and I think that often points to what are arguably bad indexing choices, like indexing text columns that have long text strings. Do you happen to have the same stat handy for a sensibly wide text or numeric real world index? It'd be interesting to see what the worst case there is. Yes, as it happens I do: http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=a...@mail.gmail.com I was working of my Mouse Genome database, which is actually real-world data use by medical researchers, stored in a PostgreSQL database by those researchers and made available for the benefit of other medical researchers. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 1:33 PM, Peter Geoghegan p...@heroku.com wrote: I can't imagine that this is much of a problem in practice. Although I will add that not caching highly useful inner pages for the medium term, because that index isn't being used at all for 5 minutes probably is very bad. Using the 4,828 buffers that it would take to store all the inner pages (as in my large primary index example) to go store something else is probably penny wise and pound foolish. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan p...@heroku.com wrote: Although I will add that not caching highly useful inner pages for the medium term, because that index isn't being used at all for 5 minutes probably is very bad. Using the 4,828 buffers that it would take to store all the inner pages (as in my large primary index example) to go store something else is probably penny wise and pound foolish. But there could easily be 20 unused indexes for every 1 index that is being used. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Thu, Apr 17, 2014 at 6:50 PM, Greg Stark st...@mit.edu wrote: On Thu, Apr 17, 2014 at 4:48 PM, Peter Geoghegan p...@heroku.com wrote: Although I will add that not caching highly useful inner pages for the medium term, because that index isn't being used at all for 5 minutes probably is very bad. Using the 4,828 buffers that it would take to store all the inner pages (as in my large primary index example) to go store something else is probably penny wise and pound foolish. But there could easily be 20 unused indexes for every 1 index that is being used. Sure, but then there might not be. Obviously there is a trade-off to be made between recency and frequency. One interesting observation in the LRU-K paper is that for their test case, a pure LFU actually works very well, despite, as the authors acknowledge, being a terrible algorithm in the real world. That's because their test case is so simple, and concerns only one table/index, with a uniform distribution. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 15, 2014 at 9:44 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan p...@heroku.com wrote: In the past, various hackers have noted problems they've observed with this scheme. A common pathology is to see frantic searching for a victim buffer only to find all buffer usage_count values at 5. It may take multiple revolutions of the clock hand before a victim buffer is found, as usage_count is decremented for each and every buffer. Also, BufFreelistLock contention is considered a serious bottleneck [1], which is related. I think that the basic problem here is that usage counts increase when buffers are referenced, but they decrease when buffers are evicted, and those two things are not in any necessary way connected to each other. In particular, if no eviction is happening, reference counts will converge to the maximum value. I've read a few papers about algorithms that attempt to segregate the list of buffers into hot and cold lists, and an important property of such algorithms is that they mustn't be allowed to make everything hot. It's possible that I've misunderstood what you mean here, but do you really think it's likely that everything will be hot, in the event of using something like what I've sketched here? I think it's an important measure against this general problem that buffers really earn the right to be considered hot, so to speak. With my prototype, in order for a buffer to become as hard to evict as possible, at a minimum it must be *continually* pinned for at least 30 seconds. That's actually a pretty tall order. Although, as I said, I wouldn't be surprised if it was worth making it possible for buffers to be even more difficult to evict than that. It should be extremely difficult to evict a root B-Tree page, and to a lesser extent inner pages even under a lot of cache pressure, for example. There are lots of workloads in which that can happen, and I have a hard time believing that it's worth it to evict given the extraordinary difference in their utility as compared to a lot of other things. I can imagine a huge barrier against evicting what is actually a relatively tiny number of pages being worth it. I don't want to dismiss what you're saying about heating and cooling being unrelated, but I don't find the conclusion that not everything can be hot obvious. Maybe heat should be relative rather than absolute, and maybe that's actually what you meant. There is surely some workload where buffer access actually is perfectly uniform, and what do you do there? What temperature are those buffers? It occurs to me that within the prototype patch, even though usage_count is incremented in a vastly slower fashion (in a wall time sense), clock sweep doesn't take advantage of that. I should probably investigate having clock sweep become more aggressive in decrementing in response to realizing that it won't get some buffer's usage_count down to zero on the next revolution either. There are certainly problems with that, but they might be fixable. Within the patch, in order for it to be possible for the usage_count to be incremented in the interim, an average of 1.5 seconds must pass, so if clock sweep were to anticipate another no-set-to-zero revolution, it seems pretty likely that it would be exactly right, or if not then close enough, since it can only really fail to correct for some buffers getting incremented once more in the interim. Conceptually, it would be like multiple logical revolutions were merged into one actual one, sufficient to have the next revolution find a victim buffer. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Hi, It's good to see focus on this - some improvements around s_b are sorely needed. On 2014-04-14 10:11:53 -0700, Peter Geoghegan wrote: 1) Throttles incrementation of usage_count temporally. It becomes impossible to increment usage_count for any given buffer more frequently than every 3 seconds, while decrementing usage_count is totally unaffected. I think this is unfortunately completely out of question. For one a gettimeofday() for every uffer pin will become a significant performance problem. Even the computation of the xact/stm start/stop timestamps shows up pretty heavily in profiles today - and they are far less frequent than buffer pins. And that's on x86 linux, where gettimeofday() is implemented as something more lightweight than a full syscall. The other significant problem I see with this is that its not adaptive to the actual throughput of buffers in s_b. In many cases there's hundreds of clock cycles through shared buffers in 3 seconds. By only increasing the usagecount that often you've destroyed the little semblance to a working LRU there is right now. It also wouldn't work well for situations with a fast changing workload s_b. If you have frequent queries that take a second or so and access some data repeatedly (index nodes or whatnot) only increasing the usagecount once will mean they'll continually fall back to disk access. 2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not 5 as before. ... . This step on its own would be assumed extremely counter-productive by those in the know, but I believe that other measures ameliorate the downsides. I could be wrong about how true that is in other cases, but then the case helped here isn't what you'd call a narrow benchmark. I don't see which mechanisms you have suggested that counter this? I think having more granular usagecount is a good idea, but I don't think it can realistically be implemented with the current method of choosing victim buffers. The amount of cacheline misses around that is already a major scalability limit; we surely can't make this even worse. I think it'd be possible to get back to this if we had a better bgwriter implementation. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund and...@2ndquadrant.com wrote: I think this is unfortunately completely out of question. For one a gettimeofday() for every uffer pin will become a significant performance problem. Even the computation of the xact/stm start/stop timestamps shows up pretty heavily in profiles today - and they are far less frequent than buffer pins. And that's on x86 linux, where gettimeofday() is implemented as something more lightweight than a full syscall. Come on, Andres. Of course exactly what I've done here is completely out of the question as a patch that we can go and commit right now. I've numerous caveats about bloating the buffer descriptors, and about it being a proof of concept. I'm pretty sure we can come up with a scheme to significantly cut down on the number of gettimeofday() calls if it comes down to it. In any case, I'm interested in advancing our understanding of the problem right now. Let's leave the minutiae to one side for the time being. The other significant problem I see with this is that its not adaptive to the actual throughput of buffers in s_b. In many cases there's hundreds of clock cycles through shared buffers in 3 seconds. By only increasing the usagecount that often you've destroyed the little semblance to a working LRU there is right now. If a usage_count can get to BM_MAX_USAGE_COUNT from its initial allocation within an instant, that's bad. It's that simple. Consider all the ways in which that can happen almost by accident. You could probably reasonably argue that the trade-off or lack of adaption (between an LRU and an LFU) that this particular sketch of mine represents is inappropriate or sub-optimal, but I don't understand why you're criticizing the patch for doing what I expressly set out to do. I wrote I think a very real problem that may be that approximating an LRU is bad because an actual LRU is bad. It also wouldn't work well for situations with a fast changing workload s_b. If you have frequent queries that take a second or so and access some data repeatedly (index nodes or whatnot) only increasing the usagecount once will mean they'll continually fall back to disk access. No, it shouldn't, because there is a notion of buffers getting a fair chance to prove themselves. Now, it might well be the case that there are workloads where what I've done to make that happen in this prototype doesn't work out too well - I've already said so. But should a buffer get a usage count of 5 just because the user inserted 5 tuples within a single DML command, for example? If so, why? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 2014-04-16 01:58:23 -0700, Peter Geoghegan wrote: On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund and...@2ndquadrant.com wrote: I think this is unfortunately completely out of question. For one a gettimeofday() for every uffer pin will become a significant performance problem. Even the computation of the xact/stm start/stop timestamps shows up pretty heavily in profiles today - and they are far less frequent than buffer pins. And that's on x86 linux, where gettimeofday() is implemented as something more lightweight than a full syscall. Come on, Andres. Of course exactly what I've done here is completely out of the question as a patch that we can go and commit right now. I've numerous caveats about bloating the buffer descriptors, and about it being a proof of concept. I'm pretty sure we can come up with a scheme to significantly cut down on the number of gettimeofday() calls if it comes down to it. In any case, I'm interested in advancing our understanding of the problem right now. Let's leave the minutiae to one side for the time being. *I* don't think any scheme that involves measuring the time around buffer pins is going to be acceptable. It's better than I say that now rather than when you've invested significant time into the approach, no? The other significant problem I see with this is that its not adaptive to the actual throughput of buffers in s_b. In many cases there's hundreds of clock cycles through shared buffers in 3 seconds. By only increasing the usagecount that often you've destroyed the little semblance to a working LRU there is right now. If a usage_count can get to BM_MAX_USAGE_COUNT from its initial allocation within an instant, that's bad. It's that simple. Consider all the ways in which that can happen almost by accident. Yes, I agree that that's a problem. It immediately going down to zero is a problem as well though. And that's what will happen in many scenarios, because you have time limits on increasing the usagecount, but not when decreasing. It also wouldn't work well for situations with a fast changing workload s_b. If you have frequent queries that take a second or so and access some data repeatedly (index nodes or whatnot) only increasing the usagecount once will mean they'll continually fall back to disk access. No, it shouldn't, because there is a notion of buffers getting a fair chance to prove themselves. If you have a workload with (BM_MAX_USAGE_COUNT + 1) clock cycles/second, how does *any* buffer has a chance to prove itself? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 2:18 AM, Andres Freund and...@2ndquadrant.com wrote: *I* don't think any scheme that involves measuring the time around buffer pins is going to be acceptable. It's better than I say that now rather than when you've invested significant time into the approach, no? Well, I do think that it will be possible to make something like that work. LRU-K/LRU-2 involves remembering the last two access times (not the last one). Researchers considered preeminent authorities on caching algorithms thought that was a good idea in 1993. There are plenty of other examples of similar schemes too. No, it shouldn't, because there is a notion of buffers getting a fair chance to prove themselves. If you have a workload with (BM_MAX_USAGE_COUNT + 1) clock cycles/second, how does *any* buffer has a chance to prove itself? There could be lots of ways. I thought about representing that more directly. I don't think that it's useful to have a large number of revolutions in search of a victim under any circumstances. Fundamentally, you're asking what if any scheme here leans too heavily towards frequency?. That could certainly be a problem, as I've said, and we could think about adaptation over heuristics, as I've said, but it is very obviously a big problem that clock sweep doesn't really care about frequency one bit right now. Why should I be the one with all the answers? Aren't you interested in the significance of the patch, and the test case? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
Hi, On 2014-04-16 02:57:54 -0700, Peter Geoghegan wrote: Why should I be the one with all the answers? Who said you need to be? The only thing I am saying is that I don't agree with some of your suggestions? I only responded to the thread now because downthread (in CAM3SWZQa2OAVUrfPL-df=we1smozkbr392sw_novukzepxh...@mail.gmail.com) you further argued using the timestamp - which I think is a flawed concept. So I thought it'd be fair to argue against it now, rather than later. Aren't you interested in the significance of the patch, and the test case? Not particularly in the specifics to be honest. The tradeoffs of the techniques you used in there seem prohibitive to me. It's easy to make individual cases faster by sacrificing others. Sometimes it's useful to prototype solutions while narrowing the scope for evaluation to get faster feedback, but as I don't see the solutions to be applicable in the general case... I think it's very important to improve upon the current state. It's imo one of postgres' biggest issues. But it's also far from trivial, otherwise it'd be done already. I *personally* don't think it's very likely that we can improve significantly upon the current state as long as every process regularly participates in the clock sweep. ISTM that prevents many more elaborate techniques to be used (cache misses/bus traffic, locking). But that's just gut feeling. I also think there are bigger issues than the actual LRU/whatever behaviour, namely the scalability issues around shared buffers making both small and big s_b settings major bottlenecks. But that's just where I have seen more issues personally. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 3:22 AM, Peter Geoghegan p...@heroku.com wrote: It's possible that I've misunderstood what you mean here, but do you really think it's likely that everything will be hot, in the event of using something like what I've sketched here? I think it's an important measure against this general problem that buffers really earn the right to be considered hot, so to speak. With my prototype, in order for a buffer to become as hard to evict as possible, at a minimum it must be *continually* pinned for at least 30 seconds. That's actually a pretty tall order. Although, as I said, I wouldn't be surprised if it was worth making it possible for buffers to be even more difficult to evict than that. It should be extremely difficult to evict a root B-Tree page, and to a lesser extent inner pages even under a lot of cache pressure, for example. There are lots of workloads in which that can happen, and I have a hard time believing that it's worth it to evict given the extraordinary difference in their utility as compared to a lot of other things. I can imagine a huge barrier against evicting what is actually a relatively tiny number of pages being worth it. I'm making a general statement about a property that I think a buffer eviction algorithm ought to have. I actually didn't say anything about the algorithm you've chosen one way or the other. Obviously, you've built in some protections against everything becoming hot, and that's a good thing as far as it goes. But you also have a greatly increased risk of everything becoming cold. All you need is a rate of buffer eviction that circles shared_buffers more often than once every 3 seconds, and everything will gradually cool down until you once again can't distinguish which stuff is hot from which stuff isn't. I don't want to dismiss what you're saying about heating and cooling being unrelated, but I don't find the conclusion that not everything can be hot obvious. Maybe heat should be relative rather than absolute, and maybe that's actually what you meant. There is surely some workload where buffer access actually is perfectly uniform, and what do you do there? What temperature are those buffers? Obviously, some value lower than the maximum and higher than the minimum. If they're all at max temperature and then a new buffer (a btree room or vm page, for example) comes along and is much hotter, there's no room on the scale left to express that. If they're all at min temperature and then a new buffer comes along that is just used once and thrown out, there's no room left on the scale for that buffer to emerge as a good candidate for eviction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 7:44 AM, Robert Haas robertmh...@gmail.com wrote: I think that the basic problem here is that usage counts increase when buffers are referenced, but they decrease when buffers are evicted, and those two things are not in any necessary way connected to each other. In particular, if no eviction is happening, reference counts will converge to the maximum value. I've read a few papers about algorithms that attempt to segregate the list of buffers into hot and cold lists, and an important property of such algorithms is that they mustn't be allowed to make everything hot. It's easy to be too simplistic, here: an algorithm that requires that no more than half the list be hot will fall over badly on a workload where the working set exceeds the available cache and the really hot portion of the working set is 60% of the available cache. So you need a more sophisticated algorithm than that. But that core property that not all buffers can be hot must somehow be preserved, and our algorithm doesn't. FWIW in CLOCK-Pro segregating buffers between hot and cold is tied to eviction and the clock sweep, the ratio between hot and cold is dynamically adapted based on prior experience. The main downside is that it seems to require an indirection somewhere either in the clock sweep or buffer lookup. Maybe it's possible to avoid that with some clever engineering if we think hard enough. CLOCK-Pro may also have too little memory of hotness, making it too easy to blow the whole cache away with a burst of activity. It may be useful to have a (possibly tunable) notion of fairness where one query/backend can't take over the cache even though it may be an overall win in terms of total number of I/Os performed. Maybe we need to invent Generalized CLOCK-Pro with a larger number of levels, ranging from cold, hot and scalding to infernal. :) Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 15, 2014 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Wed, Apr 16, 2014 at 5:00 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma a...@cybertec.at wrote: There's a paper on a non blocking GCLOCK algorithm, that does lock free clock sweep and buffer pinning[7]. If we decide to stay with GCLOCK it may be interesting, although I still believe that some variant of buffer nailing[8] is a better idea, my experience shows that most of the locking overhead is cache line bouncing ignoring the extreme cases where our naive spinlock implementation blows up. You might be right about that, but lets handle one problem at a time. Who knows what the bottleneck will end up being if and when we address the naivety around frequency? I want to better characterize that problem first. Just to summarize you about the previous discussion and the improvements that we decided to do in this area based on feedback are as follows: 1. Bgwriter needs to be improved so that it can help in reducing usage count and finding next victim buffer (run the clock sweep and add buffers to the free list). 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are less. 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer (a spinlock for the freelist, and an lwlock for the clock sweep). Here we can try to make it lock free based on atomic ops as well. 4. Bgwriter needs to be more aggressive, logic based on which it calculates how many buffers it needs to process needs to be improved. 5. Contention around buffer mapping locks. 6. Cacheline bouncing around the buffer header spinlocks, is there anything we can do to reduce this? 7. Choose Optimistically used buffer in StrategyGetBuffer(). 8. Don't bump the usage count every time buffer is pinned. What about: 9. Don't wait on locked buffer in the clock sweep. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote: 1. Bgwriter needs to be improved so that it can help in reducing usage count and finding next victim buffer (run the clock sweep and add buffers to the free list). 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are less. 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer (a spinlock for the freelist, and an lwlock for the clock sweep). Here we can try to make it lock free based on atomic ops as well. 4. Bgwriter needs to be more aggressive, logic based on which it calculates how many buffers it needs to process needs to be improved. 5. Contention around buffer mapping locks. 6. Cacheline bouncing around the buffer header spinlocks, is there anything we can do to reduce this? 7. Choose Optimistically used buffer in StrategyGetBuffer(). 8. Don't bump the usage count every time buffer is pinned. What about: 9. Don't wait on locked buffer in the clock sweep. I don't think we do that? Or are you referring to locked buffer headers? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 15, 2014 at 11:44 PM, Robert Haas robertmh...@gmail.com wrote: I think that the basic problem here is that usage counts increase when buffers are referenced, but they decrease when buffers are evicted, and those two things are not in any necessary way connected to each other. In particular, if no eviction is happening, reference counts will converge to the maximum value. I've read a few papers about algorithms that attempt to segregate the list of buffers into hot and cold lists, and an important property of such algorithms is that they mustn't be allowed to make everything hot. It's easy to be too simplistic, here: an algorithm that requires that no more than half the list be hot will fall over badly on a workload where the working set exceeds the available cache and the really hot portion of the working set is 60% of the available cache. So you need a more sophisticated algorithm than that. But that core property that not all buffers can be hot must somehow be preserved, and our algorithm doesn't. A while back you sketched out an idea that did something like that: hotly accessed buffers became 'perma-pinned' such that they no longer participated in the clock sweep for eviction and there was a side-line process that did a two stage eviction (IIRC) from the super hot stack in order to mitigate locking. This idea had a couple of nice properties: 1) very hot buffers no longer get refcounted, reducing spinlock contention (which has been documented in real world workloads) 2) eviction loop shrinks. although you still have to check the 'very hot' flag, thats an unlocked check (again, IIRC) and no further processing is done. The downside of this approach was complexity and difficult to test for edge case complexity. I would like to point out though that while i/o efficiency gains are nice, I think contention issues are the bigger fish to fry. On Mon, Apr 14, 2014 at 12:11 PM, Peter Geoghegan p...@heroku.com wrote: 1) Throttles incrementation of usage_count temporally. It becomes impossible to increment usage_count for any given buffer more frequently than every 3 seconds, while decrementing usage_count is totally unaffected. hm, that's expensive. how about a heuristic based on the number of buffer allocations and the size of the buffer pool? On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote: What about: 9. Don't wait on locked buffer in the clock sweep. I don't think we do that? Or are you referring to locked buffer headers? Right -- exactly. I posted patch for this a while back. It's quite trivial: implement a trylock variant of the buffer header lock macro and further guard the check with a non-locking test (which TAS() already does generally, but the idea is to avoid the cache line lock in likely cases of contention). I believe this to be unambiguously better: even if it's self healing or unlikely, there is no good reason to jump into a spinlock fray or even request a contented cache line while holding a critical lock. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 2014-04-16 08:25:23 -0500, Merlin Moncure wrote: The downside of this approach was complexity and difficult to test for edge case complexity. I would like to point out though that while i/o efficiency gains are nice, I think contention issues are the bigger fish to fry. That's my feeling as well. On Wed, Apr 16, 2014 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-04-16 07:55:44 -0500, Merlin Moncure wrote: What about: 9. Don't wait on locked buffer in the clock sweep. I don't think we do that? Or are you referring to locked buffer headers? Right -- exactly. I posted patch for this a while back. It's quite trivial: implement a trylock variant of the buffer header lock macro and further guard the check with a non-locking test (which TAS() already does generally, but the idea is to avoid the cache line lock in likely cases of contention). I believe this to be unambiguously better: even if it's self healing or unlikely, there is no good reason to jump into a spinlock fray or even request a contented cache line while holding a critical lock. IIRC you had problems proving the benefits of that, right? I think that's because the locking times of buffer headers are short enough that it's really unlikely to read a locked buffer header spinlock. The spinlock acquiration will have made the locker the exclusive owner of the spinlock in the majority of cases, and as soon as that happens the cache miss/transfer will take far longer than the lock takes. I think this is the wrong level to optimize things. Imo there's two possible solutions (that don't exclude each other): * perform the clock sweep in one process so there's a very fast way to get to a free buffer. Possibly in a partitioned way. * Don't take a global exclusive lock while performing the clock sweep. Instead increase StrategyControl-nextVictimBuffer in chunks under an exclusive lock, and then scan the potential victim buffers in those chunks without a global lock held. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers