Hi, As mentioned nearby [1], Thomas brought up [2] the idea of using ReadRecentBuffer() _bt_getroot(). I couldn't resist and prototyped it.
Unfortunately it scaled way worse at first. This is not an inherent issue, but due to an implementation choice in ReadRecentBuffer(). Whereas the normal BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does LockBufHdr(), checks if the buffer ID is the same and then uses PinBuffer_Locked(). The problem with that is that PinBuffer() takes care to not hold the buffer header spinlock, it uses compare_exchange to atomically acquire the pin, while guaranteing nobody holds the lock. When holding the buffer header spinlock, there obviously is the risk of being scheduled out (or even just not have exclusive access to the cacheline). ReadRecentBuffer() scales worse even if LockBufHdr() is immediately followed by PinBuffer_Locked(), so it's really just holding the lock that is the issue. The fairly obvious solution to this is to just use PinBuffer() and just unpin the buffer if its identity was changed concurrently. There could be an unlocked pre-check as well. However, there's the following comment in ReadRecentBuffer(): * It's now safe to pin the buffer. We can't pin first and ask * questions later, because it might confuse code paths like * InvalidateBuffer() if we pinned a random non-matching buffer. */ But I'm not sure I buy that - there's plenty other things that can briefly acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other contents, etc). Another difference between using PinBuffer() and PinBuffer_locked() is that the latter does not adjust a buffer's usagecount. Leaving the scalability issue aside, isn't it somewhat odd that optimizing a codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not increasing usagecount anymore? FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching the root page's buffer id in RelationData, seems a noticeable win. About 7% in a concurrent, read-only pgbench that utilizes batches of 10. And it should be easy to get much bigger wins, e.g. with a index nested loop with a relatively small index on the inner side. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20230627013458.axge7iylw7llyvww%40awork3.anarazel.de [2] https://twitter.com/MengTangmu/status/1673439083518115840