On 2013-06-27 09:50:32 -0400, Robert Haas wrote: > On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund <and...@2ndquadrant.com> wrote: > > Contention wise I aggree. What I have seen is that we have a huge > > amount of cacheline bouncing around the buffer header spinlocks. > > How did you measure that?
perf record -e cache-misses. If you want it more detailed looking at {L1,LLC}-{load,store}{s,misses} can sometimes be helpful too. Also, running perf stat -vvv postgres -D ... for a whole benchmark can be useful to compare how much a change influences cache misses and such. For very detailed analysis running something under valgrind/cachegrind can be helpful too, but I usually find perf to be sufficient. > > I have previously added some adhoc instrumentation that printed the > > amount of buffers that were required (by other backends) during a > > bgwriter cycle and the amount of buffers that the buffer manager could > > actually write out. > > I think you can see how many are needed from buffers_alloc. No? Not easily correlated with bgwriter activity. If we cannot keep up because it's 100% busy writing out buffers I don't have many problems with that. But I don't think we often are. > > Problems with the current code: > > > > * doesn't manipulate the usage_count and never does anything to used > > pages. Which means it will just about never find a victim buffer in a > > busy database. > > Right. I was thinking that was part of this patch, but it isn't. I > think we should definitely add that. In other words, the background > writer's job should be to run the clock sweep and add buffers to the > free list. We might need to split it into two for that. One process to writeout dirty pages, one to populate the freelist. Otherwise we will probably regularly hit the current scalability issues because we're currently io contended. Say during a busy or even immediate checkpoint. > I think we should also split the lock: a spinlock for the > freelist, and an lwlock for the clock sweep. Yea, thought about that when writing the thing about the exclusive lock during the clocksweep. > > * by far not aggressive enough, touches only a few buffers ahead of the > > clock sweep. > > Check. Fixing this might be a separate patch, but then again maybe > not. The changes we're talking about here provide a natural feedback > mechanism: if we observe that the freelist is empty (or less than some > length, like 32 buffers?) set the background writer's latch, because > we know it's not keeping up. Yes, that makes sense. Also provides adaptability to bursty workloads which means we don't have too complex logic in the bgwriter for that. > > There's another thing we could do to noticeably improve scalability of > > buffer acquiration. Currently we do a huge amount of work under the > > freelist lock. > > ... > > So, we perform the entire clock sweep until we found a single buffer we > > can use inside a *global* lock. At times we need to iterate over the > > whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all > > the usage counts enough (if the database is busy it can take even > > longer...). > > In a busy database where usually all the usagecounts are high the next > > backend will touch a lot of those buffers again which causes massive > > cache eviction & bouncing. > > > > It seems far more sensible to only protect the clock sweep's > > nextVictimBuffer with a spinlock. With some care all the rest can happen > > without any global interlock. > > That's a lot more spinlock acquire/release cycles, but it might work > out to a win anyway. Or it might lead to the system suffering a > horrible spinlock-induced death spiral on eviction-heavy workloads. I can't imagine it to be worse that what we have today. Also, nobody requires us to only advance the clocksweep by one page, we can easily do it say 29 pages at a time or so if we detect the lock is contended. Alternatively it shouldn't be too hard to make it into an atomic increment, although that requires some trickery to handle the wraparound sanely. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers