A few days ago, during a conversation about another patch, Andres and
I realized that my patch to set the VM on-access could lead to
out-of-date freespace maps. Pages set all-visible in the VM can be
skipped by vacuum, which would then not update the FSM for that page.

The solution to this is to update the FSM in on-access pruning if we
updated the VM.
There is, of course, the possibility of additional overhead here. The
needed freespace map page itself will often be cached, but there is
overhead to pinning it, and, if the figure has changed, to dirtying
it.

Dirtying it isn't a big concern because RecordPageWithFreeSpace() only
dirties the FSM when the freespace category actually changes -- in
which case it's worth it. And it's not WAL-logged, so there isn't much
to writing to it.

I concocted the worst case scenario I could come up with -- a relation
where every single page had to set the VM and needed a FSM update and
the query returns no rows and deforms no tuples. In this case, there
was a few percentage point slowdown due to the extra buffer pinning
and lock acquire/release. (Repro at end of email)

I tried to think of some heuristics so that we could limit when we did
the FSM pinning and locking, but none seemed very good. We could check
if the new amount of free space is bigger than an FSM category step
(32), but that doesn't help us if we are correcting an FSM
overestimation. This could happen because inserts don't update the FSM
until the inserting tuple doesn't fit on the target page.

I also thought of caching the pinned FSM page in the scan descriptor
like we do with the VM page. This doesn't work as nicely because each
FSM page covers fewer heap pages. Also, the pinning and unpinning all
happens inside of the FSM API functions.

Therefore, I think the best option is the simplest -- if we set the
page all-visible, also see if we should update the FSM.

Note that this fix is only needed for the primary -- when pruning set
the page all-visible on-access and emitted a WAL record for it, the
standby was already updating the FSM while replaying the prune record
(in heap_xlog_prune_freeze() -> XLogRecordPageWithFreeSpace()).

And, finally, we do not have to worry about vacuuming the FSM
on-access because vacuum will still do it for ranges of total pages --
regardless of what it skipped.

Proposed patch attached. This requires a backpatch to 19.

Repro:
CREATE TABLE t (id int, pad text) WITH (autovacuum_enabled=off);
-- 170000 is ~10,000 pages of 420-byte rows
INSERT INTO t SELECT g, repeat('x',420) FROM generate_series(1,170000);
VACUUM (FREEZE) t;
-- create one removable dead tuple per page by shrinking the first row per page
UPDATE t SET pad = repeat('y',5)
  WHERE id IN (SELECT min(id) FROM t GROUP BY (ctid::text::point)[0]);
-- advance the xmin horizon (so previously created dead tuples are
removable on-access)
create table dummy (a int);
-- on-access prune during scan sets every page all-visible and updates FSM
SELECT 1 FROM t OFFSET 10000000;
-- this will show a different number before and after my patch
SELECT sum(avail) FROM pg_freespace('t');

- Melanie
From 5b170ea2c43b4ab828e67afc0bc875b1d593ecd6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 29 Jun 2026 18:24:01 -0400
Subject: [PATCH v1] Update FSM after updating VM on-access

b46e1e54d078de allowed setting the VM while on-access pruning, but it
neglected to update the freespace map. Once the page was all-visible,
vacuum could skip it, leading to stale freespace map values and,
effectively, bloat. Fix it by updating the FSM if we updated the VM.
---
 src/backend/access/heap/pruneheap.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index fdddd23035b..bcffec3055e 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
+#include "storage/freespace.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
@@ -321,6 +322,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer,
 
 	if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 	{
+		bool		record_free_space = false;
+		Size		freespace = 0;
+
 		/* OK, try to get exclusive buffer lock */
 		if (!ConditionalLockBufferForCleanup(buffer))
 			return;
@@ -376,16 +380,32 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer,
 			if (presult.ndeleted > presult.nnewlpdead)
 				pgstat_update_heap_dead_tuples(relation,
 											   presult.ndeleted - presult.nnewlpdead);
+
+			/*
+			 * If this prune newly set the page all-visible, VACUUM may later
+			 * skip the page and thus not update its free space map (FSM)
+			 * entry. Keep the FSM from going stale by recording it now. We do
+			 * not want to update the freespace map otherwise (to reserve
+			 * freespace on this page for future updates).
+			 */
+			if (presult.newly_all_visible)
+			{
+				record_free_space = true;
+				freespace = PageGetHeapFreeSpace(page);
+			}
 		}
 
 		/* And release buffer lock */
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
-		 * We avoid reuse of any free space created on the page by unrelated
-		 * UPDATEs/INSERTs by opting to not update the FSM at this point.  The
-		 * free space should be reused by UPDATEs to *this* page.
+		 * RecordPageWithFreeSpace() only dirties the FSM when the recorded
+		 * free-space category actually changes. Note that vacuum will still
+		 * do FreeSpaceMapVacuum() for ranges of pages that are skipped, so we
+		 * don't have to worry about that here.
 		 */
+		if (record_free_space)
+			RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);
 	}
 }
 
-- 
2.43.0

Reply via email to