On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: >> If I'm reading this right, the new statistic is an incrementing counter >> where, every time you update it, you add the number of buffers currently on >> the freelist. That makes no sense. > > I think using 'number of buffers currently on the freelist' and > 'number of recently allocated buffers' for consecutive cycles, > we can figure out approximately how many buffer allocations > needs clock sweep assuming low and high threshold water > marks are fixed. However there can be cases where it is not > easy to estimate that number.
Counters should be design in such a way that you can read it, and then read it again later, and make sense of it - you should not need to read the counter on *consecutive* cycles to interpret it. >> I think what you should be counting is the number of allocations that are >> being satisfied from the free-list. Then, by comparing the rate at which >> that value is incrementing to the rate at which buffers_alloc is >> incrementing, somebody can figure out what percentage of allocations are >> requiring a clock-sweep run. Actually, I think it's better to flip it >> around: count the number of allocations that require an individual backend >> to run the clock sweep (vs. being satisfied from the free-list); call it, >> say, buffers_backend_clocksweep. We can then try to tune the patch to make >> that number as small as possible under varying workloads. > > This can give us clear idea to tune the patch, however we need > to maintain 3 counters for it in code(recent_alloc (needed for > current bgwriter logic) and other 2 suggested by you). Do you > want to retain such counters in code or it's for kind of debug info > for patch? I only mean to propose one new counter, and I'd imagine including that in the final patch. We already have a counter of total buffer allocations; that's buffers_alloc. I'm proposing to add an additional counter for the number of those allocations not satisfied from the free list, with a name like buffers_alloc_clocksweep (I said buffers_backend_clocksweep above, but that's probably not best, as the existing buffers_backend counts buffer *writes*, not allocations). I think we would definitely want to retain this counter in the final patch, as an additional column in pg_stat_bgwriter. >>> d. Autotune the low and high threshold for freelist for various >>> configurations. >> >> I think we need to come up with some kind of formula here rather than just >> a list of hard-coded constants. > > That was my initial intention as well and I have tried based > on number of shared buffers like keeping threshold values as > percentage of shared buffers but nothing could satisfy different > kind of workloads. The current values I have choosen are based > on experiments for various workloads at different thresholds. I have > shown the lwlock_stats data for various loads based on current > thresholds upthread. Another way could be to make them as config > knobs and use the values as given by user incase it is provided by > user else go with fixed values. How did you go about determining the optimal value for a particular workload? When the list is kept short, it's less likely that a value on the list will be referenced or dirtied again before the page is actually recycled. That's clearly good. But when the list is long, it's less likely to become completely empty and thereby force individual backends to run the clock-sweep. My suspicion is that, when the number of buffers is small, the impact of the list being too short isn't likely to be very significant, because running the clock-sweep isn't all that expensive anyway - even if you have to scan through the entire buffer pool multiple times, there aren't that many buffers. But when the number of buffers is large, those repeated scans can cause a major performance hit, so having an adequate pool of free buffers becomes much more important. I think your list of high-watermarks is far too generous for low buffer counts. With more than 100k shared buffers, you've got a high-watermark of 2k buffers, which means that 2% or less of the buffers will be on the freelist, which seems a little on the high side to me, but probably in the ballpark of what we should be aiming for. But at 10001 shared buffers, you can have 1000 of them on the freelist, which is 10% of the buffer pool; that seems high. At 101 shared buffers, 75% of the buffers in the system can be on the freelist; that seems ridiculous. The chances of a buffer still being unused by the time it reaches the head of the freelist seem very small. Based on your existing list of thresholds, and taking the above into account, I'd suggest something like this: let the high-watermark for the freelist be 0.5% of the total number of buffers, with a maximum of 2000 and a minimum of 5. Let the low-watermark be 20% of the high-watermark. That might not be best, but I think some kind of formula like that can likely be made to work. I would suggest focusing your testing on configurations with *large* settings for shared_buffers, say 1-64GB, rather than small configurations. Anyone who cares greatly about performance isn't going to be running with only 8MB of shared_buffers anyway. Arguably we shouldn't even run the reclaim process on very small configurations; I think there should probably a GUC (PGC_SIGHUP) to control whether it gets launched. I think it would be a good idea to analyze how frequently the reclaim process gets woken up. In the worst case, this happens once per (high watermark - low watermark) allocations; that is, the system reaches the low watermark and then does no further allocations until the reclaim process brings the freelist back up to the high watermark. But if more allocations occur between the time the reclaim process is woken and the time it reaches the high watermark, then it should run for longer, until the high watermark is reached. At least for debugging purposes, I think it would be useful to have a counter of reclaim wakeups. I'm not sure whether that's worth including in the final patch, but it might be. > That will certainly help in retaining the current behaviour of > bgwriter and make the idea cleaner. I will modify the patch > to have a new background process unless somebody thinks > otherwise. > > If we go with this approach, one thing which we need to decide > is what to do incase buf which has usage_count as zero is *dirty*, > as I don't think it is good idea to put it in freelist. I thought a bit about this yesterday. I think the problem is that we might be in a situation where buffers are being dirtied faster than they can be cleaned. In that case, if we only put clean buffers on the freelist, then every backend in the system will be fighting over the ever-dwindling supply of clean buffers until, in the worst case, there's maybe only 1 clean buffer which is getting evicted repeatedly at top speed - or maybe even no clean buffers, and the reclaim process just spins in an infinite loop looking for clean buffers that aren't there. To put that another way, the rate at which buffers are being dirtied can't exceed the rate at which they are being cleaned forever. Eventually, somebody is going to have to wait. Having the backends wait by being forced to write some dirty buffers does not seem like a bad way to accomplish that. So I favor just putting the buffers on freelist without regard to whether they are clean or dirty. If this turns out not to work well we can look at other options (probably some variant of (b) from your list). >> Instead, it would just run the clock sweep (i.e. the last loop inside >> StrategyGetBuffer) and put the buffers onto the free list. > > Don't we need to do more than just last loop inside StrategyGetBuffer(), > as clock sweep in strategy get buffer is responsible for getting one > buffer with usage_count = 0 where as we need to run the loop till it > finds and moves enough such buffers so that it can populate freelist > with number of buffers equal to high water mark of freelist. Yeah, that's what I meant. Of course, it should add each buffer to the freelist individually, not batch them up and add them all at once. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers