On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <and...@anarazel.de> wrote: > As mentioned nearby [1], Thomas brought up [2] the idea of using > ReadRecentBuffer() _bt_getroot(). I couldn't resist and prototyped it.
Thanks! > Unfortunately it scaled way worse at first. This is not an inherent issue, but > due to an implementation choice in ReadRecentBuffer(). Whereas the normal > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does > LockBufHdr(), checks if the buffer ID is the same and then uses > PinBuffer_Locked(). > > The problem with that is that PinBuffer() takes care to not hold the buffer > header spinlock, it uses compare_exchange to atomically acquire the pin, while > guaranteing nobody holds the lock. When holding the buffer header spinlock, > there obviously is the risk of being scheduled out (or even just not have > exclusive access to the cacheline). Yeah. Aside from inherent nastiness of user-space spinlocks, this new use case is also enormously more likely to contend and then get into trouble by being preempted due to btree root pages being about the hottest pages in the universe than the use case I was focusing on at the time. > The fairly obvious solution to this is to just use PinBuffer() and just unpin > the buffer if its identity was changed concurrently. There could be an > unlocked pre-check as well. However, there's the following comment in > ReadRecentBuffer(): > * It's now safe to pin the buffer. We can't pin > first and ask > * questions later, because it might confuse code > paths like > * InvalidateBuffer() if we pinned a random > non-matching buffer. > */ > > But I'm not sure I buy that - there's plenty other things that can briefly > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other > contents, etc). I may well have been too cautious with that. The worst thing I can think of right now is that InvalidateBuffer() would busy loop (as it already does in other rare cases) when it sees a pin. > Another difference between using PinBuffer() and PinBuffer_locked() is that > the latter does not adjust a buffer's usagecount. > > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not > increasing usagecount anymore? Yeah, that is not great. The simplification you suggest would fix that too, though I guess it would also bump the usage count of buffers that don't have the tag we expected; that's obviously rare and erring on a better side though. > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching > the root page's buffer id in RelationData, seems a noticeable win. About 7% in > a concurrent, read-only pgbench that utilizes batches of 10. And it should be > easy to get much bigger wins, e.g. with a index nested loop with a relatively > small index on the inner side. Wooo, that's better than I was hoping. Thanks for trying it out! I think, for the complexity involved (ie very little), it's a nice result, and worth considering even though it's also a solid clue that we could do much better than this with a (yet to be designed) longer-lived pin scheme. smgr_targblock could be another easy-to-cache candidate, ie a place where there is a single interesting hot page that we're already keeping track of with no requirement for new backend-local mapping machinery.