On Wed, Sep 10, 2025 at 6:14 AM Nazir Bilal Yavuz <[email protected]> wrote:
>

Thanks so much for the review! I've only included inline comments to
things that still might need discussion. Otherwise, I've incorporated
your suggested changes.

> From 2c8aafe30fb58516654e7d0cfdbfbb15a6a00498 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Tue, 2 Sep 2025 11:32:24 -0400
> Subject: [PATCH v3 2/4] Split FlushBuffer() into two parts
> 2-
> +/*
> + * Prepare to write and write a dirty victim buffer.
>
> Although this comment is correct, it is a bit complicated for me. How
> about 'Prepare to write and then write a dirty victim buffer'?

I've gone with  * Prepare and write out a dirty victim buffer.

> From 32f8dbe2c885ce45ef7b217c2693333525fb9b89 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Tue, 2 Sep 2025 12:43:24 -0400
> Subject: [PATCH v3 3/4] Eagerly flush bulkwrite strategy ring
>
> + * consider writing out.
> + */
> +static BufferDesc *
> +next_strat_buf_to_flush(BufferAccessStrategy strategy,
> +                        XLogRecPtr *lsn)
> +{
> +    Buffer        bufnum;
> +    BufferDesc *bufdesc;
> +
> +    while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
> +    {
>
> StrategySweepNextBuffer() returns InvalidBuffer when we reach the
> start but can strategy->buffers[strategy->sweep_current] be an
> InvalidBuffer? I mean, is the following case possible:
> strategy->buffers[strategy->sweep_current] is an InvalidBuffer but
> strategy->buffers[strategy->sweep_current + 1] is not. So, we exit
> early from the next_strat_buf_to_flush() although there are more
> buffers to consider writing out.

Yes, good thought. Actually for BAS_BULKWRITE this cannot happen
because when a buffer is not reused we overwrite its place in the
buffers array with the shared buffer we then replace it with. It can
happen for BAS_BULKREAD. Since we are only concerned with writing, I
think we can terminate after we hit an InvalidBuffer in the ring.

While looking at this, I decided it didn't make sense to have sweep
variables in the strategy object, so I've actually changed the way
StrategySweepNextBuffer() works. There was also an issue with the
sweep -- it could run into and past the starting buffer. So, I had to
change it. Take a look at the new method and let me know what you
think.

> +/*
> + * Start a sweep of the strategy ring.
> + */
> +void
> +StartStrategySweep(BufferAccessStrategy strategy)
> +{
> +    if (!strategy)
> +        return;
>
> I think we will always use this function together with
> strategy_supports_eager_flush(), right? If yes, then we do not need to
> check if the strategy is NULL. If not, then I think this function
> should return boolean to make it explicit that we can not do sweep.

Yes, I just removed this check.

> +extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
>
> All the functions in the buf_internals.h are pascal case, should we
> make this too?

I thought maybe I'd go a different way because it's sort of
informational and not a function that does stuff -- but anyway you're
right. I've given up and made all my helpers pascal case.

- Melanie
From dbcc430c4b92c2a69f84fe9ab3faa94f61eb3d99 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v4 3/9] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes.

Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Earlier version Reviewed-by: Kirill Reshke <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 189 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  48 +++++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 235 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f40f57e5582..c64268f31ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -534,7 +534,16 @@ static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object
 						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+static BufferDesc *NextStratBufToFlush(BufferAccessStrategy strategy,
+									   Buffer sweep_end,
+									   XLogRecPtr *lsn,
+									   int *sweep_cursor);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator,
+												   bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc, uint32 *buf_state,
 							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
@@ -2420,7 +2429,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 			}
 
 			/* Content lock is released inside CleanVictimBuffer */
-			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4254,6 +4263,40 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or NULL when there are no further buffers to consider
+ * writing out.
+ */
+static BufferDesc *
+NextStratBufToFlush(BufferAccessStrategy strategy,
+					Buffer sweep_end,
+					XLogRecPtr *lsn, int *sweep_cursor)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum =
+			StrategySweepNextBuffer(strategy, sweep_cursor)) != sweep_end)
+	{
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare and write out a dirty victim buffer.
  *
@@ -4264,12 +4307,14 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * bufdesc and buf_state may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc, uint32 *buf_state,
 				  bool from_ring, IOContext io_context)
 {
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	Assert(*buf_state & BM_DIRTY);
 
@@ -4277,11 +4322,143 @@ CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
 	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && StrategySupportsEagerFlush(strategy))
+	{
+		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
+		int			cursor = StrategySweepStart(strategy);
+
+		/* Clean victim buffer and find more to flush opportunistically */
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = NextStratBufToFlush(strategy, sweep_end,
+												&max_lsn, &cursor)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the block -- the pointer to the block data in memory
+ * -- which we will opportunistically flush or NULL if this buffer does not
+ *  contain a block that should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/* Must do this before taking the buffer header spinlock */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufdesc);
+
+	if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+		goto except_unlock_header;
+
+	/* We don't eagerly flush buffers used by others */
+	if (skip_pinned &&
+		(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+		 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+		goto except_unlock_header;
+
+	/* Get page LSN while holding header lock */
+	lsn = BufferGetLSN(bufdesc);
+
+	PinBuffer_Locked(bufdesc);
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* If we'll have to flush WAL to flush the block, we're done */
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unpin_buffer;
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	buf_state = LockBufHdr(bufdesc);
+	lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc, buf_state);
+
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unlock_content;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
+
+except_unlock_header:
+	UnlockBufHdr(bufdesc, buf_state);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 12bb7e2312e..8716109221b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -156,6 +156,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -270,6 +295,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+	if (++(*sweep_cursor) >= strategy->nbuffers)
+		*sweep_cursor = 0;
+
+	return strategy->buffers[*sweep_cursor];
+}
+
+/*
+ * Return the starting buffer of a sweep of the strategy ring
+ */
+int
+StrategySweepStart(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b1b81f31419..03faf80e441 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -437,6 +437,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy,
+									  int *sweep_cursor);
+extern int	StrategySweepStart(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

From c782753a430c1c967125509c6390d4e710fd2a63 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v4 2/9] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation procides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 141 +++++++++++++++++++---------
 1 file changed, 98 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f3668051574..f40f57e5582 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,13 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2414,12 +2419,8 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			/* Content lock is released inside CleanVictimBuffer */
+			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4246,53 +4247,66 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
+	XLogRecPtr	lsn;
 
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (PrepareFlushBuffer(buf, &buf_state, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc and buf_state may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+				  bool from_ring, IOContext io_context)
+{
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
+	Assert(*buf_state & BM_DIRTY);
 
-	buf_state = LockBufHdr(buf);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
+		return;
 
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with budesc for writing. buf_state and lsn are output
+ * parameters. Returns true if the buffer acutally needs writing and false
+ * otherwise. All three parameters may be modified.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn)
+{
 	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
+	*lsn = InvalidXLogRecPtr;
+	*buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4305,9 +4319,50 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * don't have buffer exclusively locked in all cases.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (*buf_state & BM_PERMANENT)
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	*buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, *buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

From 6c46b33c7a51990f1d2df0fab7dfea2f88e0861e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v4 1/9] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with
standard for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  32 ++++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 124 insertions(+), 113 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..f3668051574 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2344,130 +2340,122 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned with its header
-	 * spinlock still held!
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
-
-	/* Pin the buffer and then release the buffer spinlock */
-	PinBuffer_Locked(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
+		/*
+		 * Attempt to claim a victim buffer.  The buffer is returned with its
+		 * header spinlock still held!
+		 */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
 
-	if (buf_state & BM_VALID)
-	{
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer. Then loop around and try again.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..12bb7e2312e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -716,12 +717,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held. We must hold the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
- * if this buffer should be written and re-used.
+ * if this buffer should be written and reused.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -731,11 +741,19 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
-	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
-	 * loop if all ring members are dirty.
-	 */
-	strategy->buffers[strategy->current] = InvalidBuffer;
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (XLogNeedsFlush(lsn))
+	{
+		/*
+		 * Remove the dirty buffer from the ring; necessary to prevent an
+		 * infinite loop if all ring members are dirty.
+		 */
+		strategy->buffers[strategy->current] = InvalidBuffer;
+		return true;
+	}
 
-	return true;
+	return false;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..b1b81f31419 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -419,6 +419,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

Reply via email to