On 7/18/25 13:03, Andres Freund wrote: > Hi, Hello. Thanks again for taking the time to review the email and patch, I think we're onto something good here.
> > I'd be curious if anybody wants to argue for keeping the clock sweep. Except > for the have_free_buffer() use in autoprewarm, it's a rather trivial > patch. And I really couldn't measure regressions above the noise level, even > if absurdly extreme use cases. Hmmm... was "argue for keeping the clock sweep" supposed to read "argue for keeping the freelist"? > On 2025-07-17 14:35:13 -0400, Greg Burd wrote: >> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote: >>> I think we'll likely need something to replace it. >> Fair, this (v5) patch doesn't yet try to address this. >> >>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite >>> right. The goal of the use have_free_buffer() is obviously to stop >>> prewarming >>> shared buffers if doing so would just evict buffers. But it's not clear >>> to me >>> that we should just stop when there aren't any free buffers - what if the >>> previous buffer contents aren't the right ones? It'd make more sense to >>> me to >>> stop autoprewarm once NBuffers have been prewarmed... >> I had the same high level reaction, that autoprewarm was leveraging >> something >> convenient but not necessarily required or even correct. I'd considered >> using >> NBuffers as you describe due to similar intuitions, I'll dig into that idea >> for >> the next revision after I get to know autoprewarm a bit better. > Cool. I do think that'll be good enough. I re-added the have_free_buffer() function only now it returns false once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock has made its first complete pass. With that I reverted my changes in the autoprewarm module. The net should be the same behavior as before at startup when using that module. >>> The most obvious way around this would be to make the clock hand a 64bit >>> atomic, which would avoid the need to have a separate tracker for the >>> number >>> of passes. Unfortunately doing so would require doing a modulo >>> operation each >>> clock tick, which I think would likely be too expensive on common >>> platforms - >>> on small shared_buffers I actually see existing, relatively rarely >>> reached, >>> modulo in ClockSweepTick() show up on a Skylake-X system. >> So, this idea came back to me today as I tossed out the union branch and >> started >> over. >> >> a) can't require a power of 2 for NBuffers >> b) would like a power of 2 for NBuffers to make a few things more efficient >> c) a simple uint64 atomic counter would simplify things >> >> The attached (v5) patch takes this approach *and* avoids the modulo you were >> concerned with. My approach is to have nextVictimBuffer as a uint64 that >> only >> increments (and at some point 200 years or so might wrap around, but I >> digress). >> To get the actual "victim" you modulo that, but not with "%" you call >> clock_modulo(). In that function I use a "next power of 2" value rather >> than >> NBuffers to efficiently find the modulo and adjust for the actual value. >> Same >> for completePasses which is now a function clock_passes() that does similar >> trickery and returns the number of times the counter (nextVictimBuffer) has >> "wrapped" around modulo NBuffers. > Yea, that could work! It'd be interesting to see some performance numbers for > this... Still no performance comparisons yet, but my gut says this should reduce contention across cores on a very hot path so I'd imagine some performance improvement. >> Now that both values exist in the same uint64 it can be the atomic vessel >> that coordinates them, no synchronization problems at all and no requirement >> for the buffer_strategy_lock. > Nice! > > >>> I think while at it, we should make ClockSweepTick() decrement >>> nextVictimBuffer by atomically subtracting NBuffers, rather than using >>> CAS. I >>> recently noticed that the CAS sometimes has to retry a fair number of >>> times, >>> which in turn makes the `victim % NBuffers` show up in profiles. >> In my (v5) patch there is one CAS that increments NBuffers. All other >> operations on NBuffers are atomic reads. The modulo you mention is gone >> entirely, unnecessary AFAICT. > There shouldn't be any CASes needed now, right? Just a fetch-add? The latter > often scales *way* better under contention. > > [Looks at the patch ...] > > Which I think is true in your patch, I don't see any CAS. You are correct, no CAS at all anymore just a mental mistake in the last email. Now there are only atomic reads and single atomic fetch-add in ClockSweepTick(). >> Meanwhile, the tests except for Windows pass [2] for this new patch [3]. >> I'll dig into the Windows issues next week as well. > FWIW, there are backtraces generated on windows. E.g. > > https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt > > 000000cd`827fdea0 00007ff7`6ad82f88 ucrtbased!abort(void)+0x5a > [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77] > 000000cd`827fdee0 00007ff7`6aae2b7c postgres!ExceptionalCondition( > char * conditionName = 0x00007ff7`6b2a4cb8 "result < > NBuffers", > char * fileName = 0x00007ff7`6b2a4c88 > "../src/backend/storage/buffer/freelist.c", > int lineNumber = 0n139)+0x78 > [c:\cirrus\src\backend\utils\error\assert.c @ 67] > 000000cd`827fdf20 00007ff7`6aae272c postgres!clock_modulo( > unsigned int64 counter = 0x101)+0x6c > [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139] > 000000cd`827fdf60 00007ff7`6aad8647 postgres!StrategySyncStart( > unsigned int * complete_passes = 0x000000cd`827fdfc0, > unsigned int * num_buf_alloc = > 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @ > 300] > 000000cd`827fdfa0 00007ff7`6aa254a3 postgres!BgBufferSync( > struct WritebackContext * wb_context = > 0x000000cd`827fe180)+0x37 [c:\cirrus\src\backend\storage\buffer\bufmgr.c @ > 3649] > 000000cd`827fe030 00007ff7`6aa278a7 postgres!BackgroundWriterMain( > void * startup_data = 0x00000000`00000000, > unsigned int64 startup_data_len = 0)+0x243 > [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236] > 000000cd`827ff5a0 00007ff7`6a8daf19 postgres!SubPostmasterMain( > int argc = 0n3, > char ** argv = 0x0000028f`e75d24d0)+0x2f7 > [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714] > 000000cd`827ff620 00007ff7`6af0f5a9 postgres!main( > int argc = 0n3, > char ** argv = 0x0000028f`e75d24d0)+0x329 > [c:\cirrus\src\backend\main\main.c @ 222] > > I.e. your new assertion failed for some reason that i can't *immediately* see. I put that in as a precaution and as a way to communicate the intention of the other code above it. I never imagined it would assert. I've changed clock_read() to only assert when the modulo differs and left that assert in the calling ClockSweepTick() function because it was redundant and I'm curious to see if we see a similar assert when testing the modulo. >> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context) >> >> /* >> * Compute strategy_delta = how many buffers have been scanned by the >> - * clock sweep since last time. If first time through, assume none. >> Then >> - * see if we are still ahead of the clock sweep, and if so, how many >> + * clock-sweep since last time. If first time through, assume none. >> Then >> + * see if we are still ahead of the clock-sweep, and if so, how many >> * buffers we could scan before we'd catch up with it and "lap" it. >> Note: >> * weird-looking coding of xxx_passes comparisons are to avoid bogus >> * behavior when the passes counts wrap around. >> */ >> if (saved_info_valid) >> { >> - int32 passes_delta = strategy_passes - >> prev_strategy_passes; >> + int32 passes_delta; >> + >> + if (unlikely(prev_strategy_passes > strategy_passes)) >> + { >> + /* wrap-around case */ >> + passes_delta = (int32) (UINT32_MAX - >> prev_strategy_passes + strategy_passes); >> + } >> + else >> + { >> + passes_delta = (int32) (strategy_passes - >> prev_strategy_passes); >> + } >> >> strategy_delta = strategy_buf_id - prev_strategy_buf_id; >> strategy_delta += (long) passes_delta * NBuffers; > That seems somewhat independent of the rest of the change, or am I missing > something? That change is there to cover the possibility of someone managing to overflow and wrap a uint64 which is *highly* unlikely. If this degree of paranoia isn't required I'm happy to remove it. >> +static uint32 NBuffersPow2; /* NBuffers rounded up to the next >> power of 2 */ >> +static uint32 NBuffersPow2Shift; /* Amount to bitshift NBuffers for >> + * >> division */ > For performance in ClockSweepTick() it might more sense to store the mask > (i.e. NBuffersPow2 - 1), rather than the actual power of two. Agreed, I've done that and created one more calculated value that could be pre-computed once and never again (unless NBuffers changes) at runtime. > Greetings, > > Andres Freund thanks again for the review, v6 attached and re-based onto afa5c365ec5, also on GitHub at [1][2]. -greg [1] https://github.com/gburd/postgres/pull/7/checks [2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v6
From 4b747751d9c2fb679496f8c0c0d4dd4373a14b48 Mon Sep 17 00:00:00 2001 From: Greg Burd <g...@burd.me> Date: Thu, 10 Jul 2025 14:45:32 -0400 Subject: [PATCH v6 1/2] Eliminate the freelist from the buffer manager and depend on clock-sweep. This set of changes removes the list of available buffers and instead simply uses the clock-sweep algorithm to find and return an available buffer. While on the surface this appears to be removing an optimization it is in fact eliminating code that induces overhead in the form of synchronization that is problemmatic for multi-core systems. This also changes the have_free_buffer() function to return true until every buffer in the pool has been considered once by the clock-sweep algorithm so as to inform the the pg_prewarm module as to when to stop warming. --- src/backend/storage/buffer/README | 42 +++------ src/backend/storage/buffer/buf_init.c | 9 -- src/backend/storage/buffer/bufmgr.c | 29 +------ src/backend/storage/buffer/freelist.c | 120 +++----------------------- src/include/storage/buf_internals.h | 11 +-- 5 files changed, 28 insertions(+), 183 deletions(-) diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README index a182fcd660c..cd52effd911 100644 --- a/src/backend/storage/buffer/README +++ b/src/backend/storage/buffer/README @@ -128,11 +128,11 @@ independently. If it is necessary to lock more than one partition at a time, they must be locked in partition-number order to avoid risk of deadlock. * A separate system-wide spinlock, buffer_strategy_lock, provides mutual -exclusion for operations that access the buffer free list or select -buffers for replacement. A spinlock is used here rather than a lightweight -lock for efficiency; no other locks of any sort should be acquired while -buffer_strategy_lock is held. This is essential to allow buffer replacement -to happen in multiple backends with reasonable concurrency. +exclusion for operations that select buffers for replacement. A spinlock is +used here rather than a lightweight lock for efficiency; no other locks of any +sort should be acquired while buffer_strategy_lock is held. This is essential +to allow buffer replacement to happen in multiple backends with reasonable +concurrency. * Each buffer header contains a spinlock that must be taken when examining or changing fields of that buffer header. This allows operations such as @@ -158,18 +158,9 @@ unset by sleeping on the buffer's condition variable. Normal Buffer Replacement Strategy ---------------------------------- -There is a "free list" of buffers that are prime candidates for replacement. -In particular, buffers that are completely free (contain no valid page) are -always in this list. We could also throw buffers into this list if we -consider their pages unlikely to be needed soon; however, the current -algorithm never does that. The list is singly-linked using fields in the -buffer headers; we maintain head and tail pointers in global variables. -(Note: although the list links are in the buffer headers, they are -considered to be protected by the buffer_strategy_lock, not the buffer-header -spinlocks.) To choose a victim buffer to recycle when there are no free -buffers available, we use a simple clock-sweep algorithm, which avoids the -need to take system-wide locks during common operations. It works like -this: +To choose a victim buffer to recycle when there are no free buffers available, +we use a simple clock-sweep algorithm, which avoids the need to take +system-wide locks during common operations. It works like this: Each buffer header contains a usage counter, which is incremented (up to a small limit value) whenever the buffer is pinned. (This requires only the @@ -184,20 +175,14 @@ The algorithm for a process that needs to obtain a victim buffer is: 1. Obtain buffer_strategy_lock. -2. If buffer free list is nonempty, remove its head buffer. Release -buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count, -it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer, -and return it. +2. Select the buffer pointed to by nextVictimBuffer, and circularly advance +nextVictimBuffer for next time. Release buffer_strategy_lock. -3. Otherwise, the buffer free list is empty. Select the buffer pointed to by -nextVictimBuffer, and circularly advance nextVictimBuffer for next time. -Release buffer_strategy_lock. - -4. If the selected buffer is pinned or has a nonzero usage count, it cannot +3. If the selected buffer is pinned or has a nonzero usage count, it cannot be used. Decrement its usage count (if nonzero), reacquire buffer_strategy_lock, and return to step 3 to examine the next buffer. -5. Pin the selected buffer, and return. +4. Pin the selected buffer, and return. (Note that if the selected buffer is dirty, we will have to write it out before we can recycle it; if someone else pins the buffer meanwhile we will @@ -234,7 +219,7 @@ the ring strategy effectively degrades to the normal strategy. VACUUM uses a ring like sequential scans, however, the size of this ring is controlled by the vacuum_buffer_usage_limit GUC. Dirty pages are not removed -from the ring. Instead, WAL is flushed if needed to allow reuse of the +from the ring. Instead, the WAL is flushed if needed to allow reuse of the buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's buffers were sent to the freelist, which was effectively a buffer ring of 1 buffer, resulting in excessive WAL flushing. @@ -277,3 +262,4 @@ As of 8.4, background writer starts during recovery mode when there is some form of potentially extended recovery to perform. It performs an identical service to normal processing, except that checkpoints it writes are technically restartpoints. + diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index ed1dc488a42..6fd3a6bbac5 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -128,20 +128,11 @@ BufferManagerShmemInit(void) pgaio_wref_clear(&buf->io_wref); - /* - * Initially link all the buffers together as unused. Subsequent - * management of this list is done by freelist.c. - */ - buf->freeNext = i + 1; - LWLockInitialize(BufferDescriptorGetContentLock(buf), LWTRANCHE_BUFFER_CONTENT); ConditionVariableInit(BufferDescriptorGetIOCV(buf)); } - - /* Correct last entry of linked list */ - GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST; } /* Init other shared buffer-management stuff */ diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6afdd28dba6..af5ef025229 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -2099,12 +2099,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, */ UnpinBuffer(victim_buf_hdr); - /* - * The victim buffer we acquired previously is clean and unused, let - * it be found again quickly - */ - StrategyFreeBuffer(victim_buf_hdr); - /* remaining code should match code at top of routine */ existing_buf_hdr = GetBufferDescriptor(existing_buf_id); @@ -2163,8 +2157,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, } /* - * InvalidateBuffer -- mark a shared buffer invalid and return it to the - * freelist. + * InvalidateBuffer -- mark a shared buffer invalid. * * The buffer header spinlock must be held at entry. We drop it before * returning. (This is sane because the caller must have locked the @@ -2262,11 +2255,6 @@ retry: * Done with mapping lock. */ LWLockRelease(oldPartitionLock); - - /* - * Insert the buffer at the head of the list of free buffers. - */ - StrategyFreeBuffer(buf); } /* @@ -2684,11 +2672,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, { BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1); - /* - * The victim buffer we acquired previously is clean and unused, - * let it be found again quickly - */ - StrategyFreeBuffer(buf_hdr); UnpinBuffer(buf_hdr); } @@ -2763,12 +2746,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, valid = PinBuffer(existing_hdr, strategy); LWLockRelease(partition_lock); - - /* - * The victim buffer we acquired previously is clean and unused, - * let it be found again quickly - */ - StrategyFreeBuffer(victim_buf_hdr); UnpinBuffer(victim_buf_hdr); buffers[i] = BufferDescriptorGetBuffer(existing_hdr); @@ -3666,8 +3643,8 @@ BgBufferSync(WritebackContext *wb_context) uint32 new_recent_alloc; /* - * Find out where the freelist clock sweep currently is, and how many - * buffer allocations have happened since our last call. + * Find out where the clock sweep currently is, and how many buffer + * allocations have happened since our last call. */ strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc); diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 01909be0272..162c140fb9d 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -39,14 +39,6 @@ typedef struct */ pg_atomic_uint32 nextVictimBuffer; - int firstFreeBuffer; /* Head of list of unused buffers */ - int lastFreeBuffer; /* Tail of list of unused buffers */ - - /* - * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is, - * when the list is empty) - */ - /* * Statistics. These counters should be wide enough that they can't * overflow during a single bgwriter cycle. @@ -164,17 +156,16 @@ ClockSweepTick(void) } /* - * have_free_buffer -- a lockless check to see if there is a free buffer in - * buffer pool. + * have_free_buffer -- check if we've filled the buffer pool at startup * - * If the result is true that will become stale once free buffers are moved out - * by other operations, so the caller who strictly want to use a free buffer - * should not call this. + * Used exclusively by autoprewarm. */ bool have_free_buffer(void) { - if (StrategyControl->firstFreeBuffer >= 0) + uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer); + + if (hand < NBuffers) return true; else return false; @@ -243,75 +234,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r } /* - * We count buffer allocation requests so that the bgwriter can estimate - * the rate of buffer consumption. Note that buffers recycled by a - * strategy object are intentionally not counted here. + * We keep an approximate count of buffer allocation requests so that the + * bgwriter can estimate the rate of buffer consumption. Note that + * buffers recycled by a strategy object are intentionally not counted + * here. */ pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1); - /* - * First check, without acquiring the lock, whether there's buffers in the - * freelist. Since we otherwise don't require the spinlock in every - * StrategyGetBuffer() invocation, it'd be sad to acquire it here - - * uselessly in most cases. That obviously leaves a race where a buffer is - * put on the freelist but we don't see the store yet - but that's pretty - * harmless, it'll just get used during the next buffer acquisition. - * - * If there's buffers on the freelist, acquire the spinlock to pop one - * buffer of the freelist. Then check whether that buffer is usable and - * repeat if not. - * - * Note that the freeNext fields are considered to be protected by the - * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to - * manipulate them without holding the spinlock. - */ - if (StrategyControl->firstFreeBuffer >= 0) - { - while (true) - { - /* Acquire the spinlock to remove element from the freelist */ - SpinLockAcquire(&StrategyControl->buffer_strategy_lock); - - if (StrategyControl->firstFreeBuffer < 0) - { - SpinLockRelease(&StrategyControl->buffer_strategy_lock); - break; - } - - buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer); - Assert(buf->freeNext != FREENEXT_NOT_IN_LIST); - - /* Unconditionally remove buffer from freelist */ - StrategyControl->firstFreeBuffer = buf->freeNext; - buf->freeNext = FREENEXT_NOT_IN_LIST; - - /* - * Release the lock so someone else can access the freelist while - * we check out this buffer. - */ - SpinLockRelease(&StrategyControl->buffer_strategy_lock); - - /* - * If the buffer is pinned or has a nonzero usage_count, we cannot - * use it; discard it and retry. (This can only happen if VACUUM - * put a valid buffer in the freelist and then someone else used - * it before we got to it. It's probably impossible altogether as - * of 8.3, but we'd better check anyway.) - */ - local_buf_state = LockBufHdr(buf); - if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0 - && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0) - { - if (strategy != NULL) - AddBufferToRing(strategy, buf); - *buf_state = local_buf_state; - return buf; - } - UnlockBufHdr(buf, local_buf_state); - } - } - - /* Nothing on the freelist, so run the "clock sweep" algorithm */ + /* Use the "clock sweep" algorithm to find a free buffer */ trycounter = NBuffers; for (;;) { @@ -356,29 +286,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r } } -/* - * StrategyFreeBuffer: put a buffer on the freelist - */ -void -StrategyFreeBuffer(BufferDesc *buf) -{ - SpinLockAcquire(&StrategyControl->buffer_strategy_lock); - - /* - * It is possible that we are told to put something in the freelist that - * is already in it; don't screw up the list if so. - */ - if (buf->freeNext == FREENEXT_NOT_IN_LIST) - { - buf->freeNext = StrategyControl->firstFreeBuffer; - if (buf->freeNext < 0) - StrategyControl->lastFreeBuffer = buf->buf_id; - StrategyControl->firstFreeBuffer = buf->buf_id; - } - - SpinLockRelease(&StrategyControl->buffer_strategy_lock); -} - /* * StrategySyncStart -- tell BgBufferSync where to start syncing * @@ -504,13 +411,6 @@ StrategyInitialize(bool init) SpinLockInit(&StrategyControl->buffer_strategy_lock); - /* - * Grab the whole linked list of free buffers for our strategy. We - * assume it was previously set up by BufferManagerShmemInit(). - */ - StrategyControl->firstFreeBuffer = 0; - StrategyControl->lastFreeBuffer = NBuffers - 1; - /* Initialize the clock sweep pointer */ pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0); diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index 52a71b138f7..d4449e11384 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -217,8 +217,7 @@ BufMappingPartitionLockByIndex(uint32 index) * single atomic variable. This layout allow us to do some operations in a * single atomic operation, without actually acquiring and releasing spinlock; * for instance, increase or decrease refcount. buf_id field never changes - * after initialization, so does not need locking. freeNext is protected by - * the buffer_strategy_lock not buffer header lock. The LWLock can take care + * after initialization, so does not need locking. The LWLock can take care * of itself. The buffer header lock is *not* used to control access to the * data in the buffer! * @@ -264,7 +263,6 @@ typedef struct BufferDesc pg_atomic_uint32 state; int wait_backend_pgprocno; /* backend of pin-count waiter */ - int freeNext; /* link in freelist chain */ PgAioWaitRef io_wref; /* set iff AIO is in progress */ LWLock content_lock; /* to lock access to buffer contents */ @@ -360,13 +358,6 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc) return (LWLock *) (&bdesc->content_lock); } -/* - * The freeNext field is either the index of the next freelist entry, - * or one of these special values: - */ -#define FREENEXT_END_OF_LIST (-1) -#define FREENEXT_NOT_IN_LIST (-2) - /* * Functions for acquiring/releasing a shared buffer header's spinlock. Do * not apply these to local buffers! -- 2.49.0
From 167a36a6f38383a493cea88ba574a498e4b37dce Mon Sep 17 00:00:00 2001 From: Greg Burd <g...@burd.me> Date: Fri, 11 Jul 2025 09:05:45 -0400 Subject: [PATCH v6 2/2] Remove the buffer_strategy_lock and make the clock hand a 64 bit atomic Change nextVictimBuffer to an atomic uint64 and simply atomically increment it by 1 at each tick. The next victim buffer is the the value of nextVictimBuffer modulo the number of buffers (NBuffers). Modulo can be expensive so we implement that as if the value of NBuffers was requied to be a power of 2 and account for the difference. The value of nextVictimBuffer, because it is only ever incremented, now encodes enough information to provide the number of completed passes of the clock-sweep algorithm as well. This eliminates the need for a separate counter and related maintainance. While wrap-around of nextVictimBuffer would require at least 200 years on today's hardware, should that happen BgBuferSync will properly determine the delta of passes. With the removal of the freelist and completePasses none of remaining items in the BufferStrategyControl structure require strict coordination and so it is possible to eliminate the buffer_strategy_lock as well. --- src/backend/storage/buffer/README | 48 ++++--- src/backend/storage/buffer/bufmgr.c | 20 ++- src/backend/storage/buffer/freelist.c | 176 +++++++++++++------------- src/backend/storage/buffer/localbuf.c | 2 +- src/include/storage/buf_internals.h | 4 +- 5 files changed, 131 insertions(+), 119 deletions(-) diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README index cd52effd911..d1ab222eeb8 100644 --- a/src/backend/storage/buffer/README +++ b/src/backend/storage/buffer/README @@ -127,11 +127,10 @@ bits of the tag's hash value. The rules stated above apply to each partition independently. If it is necessary to lock more than one partition at a time, they must be locked in partition-number order to avoid risk of deadlock. -* A separate system-wide spinlock, buffer_strategy_lock, provides mutual -exclusion for operations that select buffers for replacement. A spinlock is -used here rather than a lightweight lock for efficiency; no other locks of any -sort should be acquired while buffer_strategy_lock is held. This is essential -to allow buffer replacement to happen in multiple backends with reasonable +* Operations that select buffers for replacement don't require a lock, but +rather use atomic operations to ensure coordination across backends when +accessing members of the BufferStrategyControl datastructure. This allows +buffer replacement to happen in multiple backends with reasonable concurrency. * Each buffer header contains a spinlock that must be taken when examining @@ -158,9 +157,9 @@ unset by sleeping on the buffer's condition variable. Normal Buffer Replacement Strategy ---------------------------------- -To choose a victim buffer to recycle when there are no free buffers available, -we use a simple clock-sweep algorithm, which avoids the need to take -system-wide locks during common operations. It works like this: +To choose a victim buffer to recycle we use a simple clock-sweep algorithm, +which avoids the need to take system-wide locks during common operations. It +works like this: Each buffer header contains a usage counter, which is incremented (up to a small limit value) whenever the buffer is pinned. (This requires only the @@ -168,19 +167,17 @@ buffer header spinlock, which would have to be taken anyway to increment the buffer reference count, so it's nearly free.) The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly -through all the available buffers. nextVictimBuffer is protected by the -buffer_strategy_lock. +through all the available buffers. nextVictimBuffer and completePasses are +atomic values. The algorithm for a process that needs to obtain a victim buffer is: -1. Obtain buffer_strategy_lock. +1. Select the buffer pointed to by nextVictimBuffer, and circularly advance +nextVictimBuffer for next time. -2. Select the buffer pointed to by nextVictimBuffer, and circularly advance -nextVictimBuffer for next time. Release buffer_strategy_lock. - -3. If the selected buffer is pinned or has a nonzero usage count, it cannot -be used. Decrement its usage count (if nonzero), reacquire -buffer_strategy_lock, and return to step 3 to examine the next buffer. +2. If the selected buffer is pinned or has a nonzero usage count, it cannot be +used. Decrement its usage count (if nonzero), return to step 3 to examine the +next buffer. 4. Pin the selected buffer, and return. @@ -196,9 +193,9 @@ Buffer Ring Replacement Strategy When running a query that needs to access a large number of pages just once, such as VACUUM or a large sequential scan, a different strategy is used. A page that has been touched only by such a scan is unlikely to be needed -again soon, so instead of running the normal clock sweep algorithm and +again soon, so instead of running the normal clock-sweep algorithm and blowing out the entire buffer cache, a small ring of buffers is allocated -using the normal clock sweep algorithm and those buffers are reused for the +using the normal clock-sweep algorithm and those buffers are reused for the whole scan. This also implies that much of the write traffic caused by such a statement will be done by the backend itself and not pushed off onto other processes. @@ -244,13 +241,12 @@ nextVictimBuffer (which it does not change!), looking for buffers that are dirty and not pinned nor marked with a positive usage count. It pins, writes, and releases any such buffer. -If we can assume that reading nextVictimBuffer is an atomic action, then -the writer doesn't even need to take buffer_strategy_lock in order to look -for buffers to write; it needs only to spinlock each buffer header for long -enough to check the dirtybit. Even without that assumption, the writer -only needs to take the lock long enough to read the variable value, not -while scanning the buffers. (This is a very substantial improvement in -the contention cost of the writer compared to PG 8.0.) +We enforce reading nextVictimBuffer within an atomic action so it needs only to +spinlock each buffer header for long enough to check the dirtybit. Even +without that assumption, the writer only needs to take the lock long enough to +read the variable value, not while scanning the buffers. (This is a very +substantial improvement in the contention cost of the writer compared to PG +8.0.) The background writer takes shared content lock on a buffer while writing it out (and anyone else who flushes buffer contents to disk must do so too). diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index af5ef025229..0be6f4d8c80 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3593,7 +3593,7 @@ BufferSync(int flags) * This is called periodically by the background writer process. * * Returns true if it's appropriate for the bgwriter process to go into - * low-power hibernation mode. (This happens if the strategy clock sweep + * low-power hibernation mode. (This happens if the strategy clock-sweep * has been "lapped" and no buffer allocations have occurred recently, * or if the bgwriter has been effectively disabled by setting * bgwriter_lru_maxpages to 0.) @@ -3643,7 +3643,7 @@ BgBufferSync(WritebackContext *wb_context) uint32 new_recent_alloc; /* - * Find out where the clock sweep currently is, and how many buffer + * Find out where the clock-sweep currently is, and how many buffer * allocations have happened since our last call. */ strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc); @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context) /* * Compute strategy_delta = how many buffers have been scanned by the - * clock sweep since last time. If first time through, assume none. Then - * see if we are still ahead of the clock sweep, and if so, how many + * clock-sweep since last time. If first time through, assume none. Then + * see if we are still ahead of the clock-sweep, and if so, how many * buffers we could scan before we'd catch up with it and "lap" it. Note: * weird-looking coding of xxx_passes comparisons are to avoid bogus * behavior when the passes counts wrap around. */ if (saved_info_valid) { - int32 passes_delta = strategy_passes - prev_strategy_passes; + int32 passes_delta; + + if (unlikely(prev_strategy_passes > strategy_passes)) + { + /* wrap-around case */ + passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes); + } + else + { + passes_delta = (int32) (strategy_passes - prev_strategy_passes); + } strategy_delta = strategy_buf_id - prev_strategy_buf_id; strategy_delta += (long) passes_delta * NBuffers; diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 162c140fb9d..0b49d178362 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -15,6 +15,7 @@ */ #include "postgres.h" +#include <math.h> #include "pgstat.h" #include "port/atomics.h" #include "storage/buf_internals.h" @@ -29,21 +30,17 @@ */ typedef struct { - /* Spinlock: protects the values below */ - slock_t buffer_strategy_lock; - /* - * Clock sweep hand: index of next buffer to consider grabbing. Note that - * this isn't a concrete buffer - we only ever increase the value. So, to - * get an actual buffer, it needs to be used modulo NBuffers. + * This is used as both the clock-sweep hand and the number of of complete + * passes through the buffer pool. The lower bits below NBuffers are the + * clock-sweep and the upper bits are the number of complete passes. */ - pg_atomic_uint32 nextVictimBuffer; + pg_atomic_uint64 nextVictimBuffer; /* * Statistics. These counters should be wide enough that they can't * overflow during a single bgwriter cycle. */ - uint32 completePasses; /* Complete cycles of the clock sweep */ pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */ /* @@ -83,12 +80,71 @@ typedef struct BufferAccessStrategyData Buffer buffers[FLEXIBLE_ARRAY_MEMBER]; } BufferAccessStrategyData; +static uint32 NBuffersPow2Mask; /* Next power-of-2 >= NBuffers - 1 */ +static uint32 NBuffersPow2Shift; /* Amount to bitshift for division */ +static uint32 NBuffersPerCycle; /* Number of buffers in a complete cycle */ /* Prototypes for internal functions */ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state); static void AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf); +static inline uint32 clock_passes(uint64 hand); +static inline uint32 clock_read(uint64 hand); + + /* + * Calculate the number of complete passes through the buffer pool that have + * happened thus far. A "pass" is defined as the clock hand moving through + * all the buffers (NBuffers) in the pool once. Our clock hand is a 64-bit + * counter that only increases. The number of passes is the upper bits of the + * counter divided by NBuffers. + */ +static inline uint32 +clock_passes(uint64 hand) +{ + uint32 result; + + /* Calculate complete next power-of-2 cycles by bitshifting */ + uint64 pow2_passes = hand >> NBuffersPow2Shift; + + /* Determine the hand's current position in the cycle */ + uint64 masked_hand = hand & NBuffersPow2Mask; + + /* Has the hand passed NBuffers yet? */ + uint32 extra_passes = (masked_hand >= NBuffers) ? 1 : 0; + + /* + * Combine total passes, multiply complete power-of-2 cycles by passes + * per-cycle, then add any extra pass from the current incomplete cycle. + */ + result = (uint32) (pow2_passes * NBuffersPerCycle) + extra_passes; + + Assert(result <= UINT32_MAX); + Assert(result == ((uint32) (hand / NBuffers))); + + return result; +} + + /* + * The hand's value is a 64-bit counter that only increases, so its position + * is determined by the lower bits of the counter modulo by NBuffers. To + * avoid the modulo operation we use the next power-of-2 mask and adjust for + * the difference. + */ +static inline uint32 +clock_read(uint64 hand) +{ + /* Determine the hand's current position in the cycle */ + uint64 result = (uint32) hand & NBuffersPow2Mask; + + /* Adjust if the next power of 2 masked counter is more than NBuffers */ + if (result >= NBuffers) + result -= NBuffers; + + Assert(result == (uint32) (hand % NBuffers)); + + return result; +} /* * ClockSweepTick - Helper routine for StrategyGetBuffer() @@ -99,6 +155,7 @@ static void AddBufferToRing(BufferAccessStrategy strategy, static inline uint32 ClockSweepTick(void) { + uint64 hand; uint32 victim; /* @@ -106,52 +163,11 @@ ClockSweepTick(void) * doing this, this can lead to buffers being returned slightly out of * apparent order. */ - victim = - pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1); + hand = pg_atomic_fetch_add_u64(&StrategyControl->nextVictimBuffer, 1); + victim = clock_read(hand); - if (victim >= NBuffers) - { - uint32 originalVictim = victim; - - /* always wrap what we look up in BufferDescriptors */ - victim = victim % NBuffers; + Assert(victim < NBuffers); - /* - * If we're the one that just caused a wraparound, force - * completePasses to be incremented while holding the spinlock. We - * need the spinlock so StrategySyncStart() can return a consistent - * value consisting of nextVictimBuffer and completePasses. - */ - if (victim == 0) - { - uint32 expected; - uint32 wrapped; - bool success = false; - - expected = originalVictim + 1; - - while (!success) - { - /* - * Acquire the spinlock while increasing completePasses. That - * allows other readers to read nextVictimBuffer and - * completePasses in a consistent manner which is required for - * StrategySyncStart(). In theory delaying the increment - * could lead to an overflow of nextVictimBuffers, but that's - * highly unlikely and wouldn't be particularly harmful. - */ - SpinLockAcquire(&StrategyControl->buffer_strategy_lock); - - wrapped = expected % NBuffers; - - success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer, - &expected, wrapped); - if (success) - StrategyControl->completePasses++; - SpinLockRelease(&StrategyControl->buffer_strategy_lock); - } - } - } return victim; } @@ -193,10 +209,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r *from_ring = false; - /* - * If given a strategy object, see whether it can select a buffer. We - * assume strategy objects don't need buffer_strategy_lock. - */ + /* If given a strategy object, see whether it can select a buffer */ if (strategy != NULL) { buf = GetBufferFromRing(strategy, buf_state); @@ -241,7 +254,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r */ pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1); - /* Use the "clock sweep" algorithm to find a free buffer */ + /* Use the "clock-sweep" algorithm to find a free buffer */ trycounter = NBuffers; for (;;) { @@ -297,32 +310,25 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r * allocs if non-NULL pointers are passed. The alloc count is reset after * being read. */ -int +uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) { - uint32 nextVictimBuffer; - int result; + uint64 counter; + uint32 result; - SpinLockAcquire(&StrategyControl->buffer_strategy_lock); - nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer); - result = nextVictimBuffer % NBuffers; + counter = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer); + result = clock_read(counter); if (complete_passes) { - *complete_passes = StrategyControl->completePasses; - - /* - * Additionally add the number of wraparounds that happened before - * completePasses could be incremented. C.f. ClockSweepTick(). - */ - *complete_passes += nextVictimBuffer / NBuffers; + *complete_passes = clock_passes(counter); } if (num_buf_alloc) { *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0); } - SpinLockRelease(&StrategyControl->buffer_strategy_lock); + return result; } @@ -337,21 +343,14 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) void StrategyNotifyBgWriter(int bgwprocno) { - /* - * We acquire buffer_strategy_lock just to ensure that the store appears - * atomic to StrategyGetBuffer. The bgwriter should call this rather - * infrequently, so there's no performance penalty from being safe. - */ - SpinLockAcquire(&StrategyControl->buffer_strategy_lock); StrategyControl->bgwprocno = bgwprocno; - SpinLockRelease(&StrategyControl->buffer_strategy_lock); } /* * StrategyShmemSize * - * estimate the size of shared memory used by the freelist-related structures. + * Estimate the size of shared memory used by the freelist-related structures. * * Note: for somewhat historical reasons, the buffer lookup hashtable size * is also determined here. @@ -404,18 +403,25 @@ StrategyInitialize(bool init) if (!found) { + uint32 NBuffersPow2; + /* * Only done once, usually in postmaster */ Assert(init); - SpinLockInit(&StrategyControl->buffer_strategy_lock); - - /* Initialize the clock sweep pointer */ - pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0); + /* Initialize combined clock-sweep pointer/complete passes counter */ + pg_atomic_init_u64(&StrategyControl->nextVictimBuffer, 0); + /* Find the smallest power of 2 larger than NBuffers */ + NBuffersPow2 = pg_nextpower2_32(NBuffers); + /* Using that, find the number of positions to shift for division */ + NBuffersPow2Shift = pg_leftmost_one_pos32(NBuffersPow2); + /* Calculate passes per power-of-2, typically 1 or 2 */ + NBuffersPerCycle = NBuffersPow2 / NBuffers; + /* The bitmask to extract the lower portion of the clock */ + NBuffersPow2Mask = NBuffersPow2 - 1; /* Clear statistics */ - StrategyControl->completePasses = 0; pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0); /* No pending notification */ @@ -659,7 +665,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state) * * If usage_count is 0 or 1 then the buffer is fair game (we expect 1, * since our own previous usage of the ring element would have left it - * there, but it might've been decremented by clock sweep since then). A + * there, but it might've been decremented by clock-sweep since then). A * higher usage_count indicates someone else has touched the buffer, so we * shouldn't re-use it. */ diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 3da9c41ee1d..7a34f5e430a 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -229,7 +229,7 @@ GetLocalVictimBuffer(void) ResourceOwnerEnlarge(CurrentResourceOwner); /* - * Need to get a new buffer. We use a clock sweep algorithm (essentially + * Need to get a new buffer. We use a clock-sweep algorithm (essentially * the same as what freelist.c does now...) */ trycounter = NLocBuffer; diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index d4449e11384..f2283ea8e22 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -81,7 +81,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32, * accuracy and speed of the clock-sweep buffer management algorithm. A * large value (comparable to NBuffers) would approximate LRU semantics. * But it can take as many as BM_MAX_USAGE_COUNT+1 complete cycles of - * clock sweeps to find a free buffer, so in practice we don't want the + * clock-sweeps to find a free buffer, so in practice we don't want the * value to be very large. */ #define BM_MAX_USAGE_COUNT 5 @@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf); extern bool StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring); -extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc); +extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc); extern void StrategyNotifyBgWriter(int bgwprocno); extern Size StrategyShmemSize(void); -- 2.49.0