On 07/11/2025 17:33, Heikki Linnakangas wrote:
Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0. We should take the "there are NO MultiXacts" codepath in that case, because we skip over 0 when assigning multixids. Instead, we call find_multixact_start with oldestMulti==1, which returns false because multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet. There's a similar bug in SetOffsetVacuumLimit().

Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get the log message. But I think there might be more serious variants of this. If the SLRU segment exists but the offset for multixid 1 hasn't been set yet, find_multixact_start() will return 0 instead, and we will proceed with the truncation based on incorrect oldestOffset==0 value, possibly removing SLRU segments that are still needed.

Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.

These scenarios were incidentally fixed by commit 789d65364c "Set next multixid's offset when creating a new multixid". (On master, commit 87a350e1f2 "Never store 0 as the nextMXact" would also have fixed it, but 789d65364c was committed first.)

Scenario 3: I also noticed that the above code isn't prepared for the race condition that the offset corresponding to 'oldestMulti' hasn't been stored in the SLRU yet, even without wraparound. That could theoretically happen if the backend executing MultiXactIdCreateFromMembers() gets stuck for a long time between the calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're saved by the fact that we only create new multixids while holding a lock on a heap page, and a system-wide VACUUM FREEZE that would advance oldestMulti would need to lock the heap page too. It's scary though, because it could also lead to truncating away members SLRU segments that are still needed. The attached patch does *not* address this scenario.

I believe this is still a potential issue, although I'm still not sure if it's possible to hit in practice. I see two avenues:

1. A long delay between GetNewMultiXactId() and RecordNewMultiXact(), as I suggested above.

2. A crash between GetNewMultiXactId() and RecordNewMultiXact(), which leaves behind an multixid with 0 offset.

I'm not sure if TruncateMultiXact() can be called with such an invalid multixid. It cannot appear anywhere in the heap, so perhaps VACUUM will never pick it as the oldest multixid. But I can't rule it out either. In any case, we should put a safeguard in TruncateMultiXact() for this. The consequences are bad, and it's easy to check for.


While working on this, I noticed that the truncation code does a couple of unnecessary things:

        /*
         * Note we can't just plow ahead with the truncation; it's possible that
         * there are no segments to truncate, which is a problem because we are
         * going to attempt to read the offsets page to determine where to
         * truncate the members SLRU.  So we first scan the directory to 
determine
         * the earliest offsets page number that we can read without error.
         *
         * When nextMXact is less than one segment away from multiWrapLimit,
         * SlruScanDirCbFindEarliest can find some early segment other than the
         * actual earliest.  (MultiXactOffsetPagePrecedes(EARLIEST, LATEST)
         * returns false, because not all pairs of entries have the same 
answer.)
         * That can also arise when an earlier truncation attempt failed 
unlink()
         * or returned early from this function.  The only consequence is
         * returning early, which wastes space that we could have liberated.
         *
         * NB: It's also possible that the page that oldestMulti is on has 
already
         * been truncated away, and we crashed before updating oldestMulti.
         */
        trunc.earliestExistingPage = -1;
        SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, 
&trunc);
        earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
        if (earliest < FirstMultiXactId)
                earliest = FirstMultiXactId;

        /* If there's nothing to remove, we can bail out early. */
        if (MultiXactIdPrecedes(oldestMulti, earliest))
        {
                LWLockRelease(MultiXactTruncationLock);
                return;
        }

        /*
         * First, compute the safe truncation point for MultiXactMember. This is
         * the starting offset of the oldest multixact.
         *
         * Hopefully, find_multixact_start will always work here, because we've
         * already checked that it doesn't precede the earliest MultiXact on 
disk.
         * But if it fails, don't truncate anything, and log a message.
         */
        if (oldestMulti == nextMulti)
        {
                /* there are NO MultiXacts */
                oldestOffset = nextOffset;
        }
        else if (!find_multixact_start(oldestMulti, &oldestOffset))
        {
                ereport(LOG,
                                (errmsg("oldest MultiXact %u not found, earliest 
MultiXact %u, skipping truncation",
                                                oldestMulti, earliest)));
                LWLockRelease(MultiXactTruncationLock);
                return;
        }

1. In the above, we scan the 'offsets' SLRU to find the earliest existing page. Per the comments, that's to avoid reading an offset that doesn't exist. But find_multixact_start() handles that gracefully; we check for that above and print the "oldest Multixact %u not found" log message if it happens. What's the point of scanning the SLRU then, why not just let find_multixact_start() detect the missing segment?

2. The above code looks up oldestOffset, which is then included in the WAL record and passed to PerformMembersTruncation(). But PerformMembersTruncation() doesn't use it for anything. It's not used during WAL replay either. This is new with the 64-bit offsets; on stable branches PerformMembersTruncation() does use it.

I'm inclined to leave this alone in stable branches, as it's not broken, just somewhat pointless. But for 'master', I propose the attached v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch.

For all branches, I propose v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a check for oldestOffset == 0. That fixes the potential for catastrophic truncation with invalid offset 0.

- Heikki
From 085401c0d226cd0bff176b6c70b3be37ae18f656 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Tue, 6 Jan 2026 13:34:35 +0200
Subject: [PATCH v1 1/2] Remove some unnecessary code from multixact truncation

With 64-bit multixact offsets, PerformMembersTruncation() doesn't need
the starting offset anymore. The 'oldestOffset' value that
TruncateMultiXact() calculates is no longer used for anything. Remove
it, and the code to calculate it.

'oldestOffset' was included in the WAL record as 'startTruncMemb',
which sounds nice if you e.g. look at the WAL with pg_waldump, but it
was also confusing because we didn't actually use the value for
determining what to truncate. Replaying the WAL would remove all
segments older than 'endTruncMemb', regardless of
'startTruncMemb'. The 'startTruncOff' stored in the WAL record was
similarly unnecessary even before 64-bit multixid offsets, it was
stored just for the sake of symmetry with 'startTruncMemb'. Remove
both from the WAL record, and rename the remaining 'endTruncOff' to
'oldestMulti' and 'endTruncMemb' to 'oldestOffset', for consistency
with the variable names used to calculate them.

Discussion: https://www.postgresql.org/message-id/[email protected]
---
 src/backend/access/rmgrdesc/mxactdesc.c |   5 +-
 src/backend/access/transam/multixact.c  | 141 +++++-------------------
 src/include/access/multixact.h          |  10 +-
 3 files changed, 32 insertions(+), 124 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 32a031d9e60..6919c7ddd2e 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -74,9 +74,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
-						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+		appendStringInfo(buf, "oldestMulti %u, oldestOffset %" PRIu64,
+						 xlrec->oldestMulti, xlrec->oldestOffset);
 	}
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3f423636b48..39e07d5924f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -282,9 +282,7 @@ static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
 static void SetOldestOffset(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
-								  MultiXactId startTruncOff,
 								  MultiXactId endTruncOff,
-								  MultiXactOffset startTruncMemb,
 								  MultiXactOffset endTruncMemb);
 
 
@@ -2558,45 +2556,22 @@ MultiXactMemberFreezeThreshold(void)
 	return Min(result, autovacuum_multixact_freeze_max_age);
 }
 
-typedef struct mxtruncinfo
-{
-	int64		earliestExistingPage;
-} mxtruncinfo;
 
 /*
- * SlruScanDirectory callback
- *		This callback determines the earliest existing page number.
- */
-static bool
-SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data)
-{
-	mxtruncinfo *trunc = (mxtruncinfo *) data;
-
-	if (trunc->earliestExistingPage == -1 ||
-		ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
-	{
-		trunc->earliestExistingPage = segpage;
-	}
-
-	return false;				/* keep going */
-}
-
-
-/*
- * Delete members segments [oldest, newOldest)
+ * Delete members segments older than newOldestOffset
  */
 static void
-PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
+PerformMembersTruncation(MultiXactOffset newOldestOffset)
 {
 	SimpleLruTruncate(MultiXactMemberCtl,
 					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
- * Delete offsets segments [oldest, newOldest)
+ * Delete offsets segments older than newOldestMulti
  */
 static void
-PerformOffsetsTruncation(MultiXactId oldestMulti, MultiXactId newOldestMulti)
+PerformOffsetsTruncation(MultiXactId newOldestMulti)
 {
 	/*
 	 * We step back one multixact to avoid passing a cutoff page that hasn't
@@ -2626,10 +2601,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	MultiXactId oldestMulti;
 	MultiXactId nextMulti;
 	MultiXactOffset newOldestOffset;
-	MultiXactOffset oldestOffset;
 	MultiXactOffset nextOffset;
-	mxtruncinfo trunc;
-	MultiXactId earliest;
 
 	Assert(!RecoveryInProgress());
 	Assert(MultiXactState->finishedStartup);
@@ -2660,59 +2632,6 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 		return;
 	}
 
-	/*
-	 * Note we can't just plow ahead with the truncation; it's possible that
-	 * there are no segments to truncate, which is a problem because we are
-	 * going to attempt to read the offsets page to determine where to
-	 * truncate the members SLRU.  So we first scan the directory to determine
-	 * the earliest offsets page number that we can read without error.
-	 *
-	 * When nextMXact is less than one segment away from multiWrapLimit,
-	 * SlruScanDirCbFindEarliest can find some early segment other than the
-	 * actual earliest.  (MultiXactOffsetPagePrecedes(EARLIEST, LATEST)
-	 * returns false, because not all pairs of entries have the same answer.)
-	 * That can also arise when an earlier truncation attempt failed unlink()
-	 * or returned early from this function.  The only consequence is
-	 * returning early, which wastes space that we could have liberated.
-	 *
-	 * NB: It's also possible that the page that oldestMulti is on has already
-	 * been truncated away, and we crashed before updating oldestMulti.
-	 */
-	trunc.earliestExistingPage = -1;
-	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
-	earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
-	if (earliest < FirstMultiXactId)
-		earliest = FirstMultiXactId;
-
-	/* If there's nothing to remove, we can bail out early. */
-	if (MultiXactIdPrecedes(oldestMulti, earliest))
-	{
-		LWLockRelease(MultiXactTruncationLock);
-		return;
-	}
-
-	/*
-	 * First, compute the safe truncation point for MultiXactMember. This is
-	 * the starting offset of the oldest multixact.
-	 *
-	 * Hopefully, find_multixact_start will always work here, because we've
-	 * already checked that it doesn't precede the earliest MultiXact on disk.
-	 * But if it fails, don't truncate anything, and log a message.
-	 */
-	if (oldestMulti == nextMulti)
-	{
-		/* there are NO MultiXacts */
-		oldestOffset = nextOffset;
-	}
-	else if (!find_multixact_start(oldestMulti, &oldestOffset))
-	{
-		ereport(LOG,
-				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-						oldestMulti, earliest)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
-	}
-
 	/*
 	 * Secondly compute up to where to truncate. Lookup the corresponding
 	 * member offset for newOldestMulti for that.
@@ -2732,13 +2651,11 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	}
 
 	elog(DEBUG1, "performing multixact truncation: "
-		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
-		 oldestMulti, newOldestMulti,
-		 MultiXactIdToOffsetSegment(oldestMulti),
+		 "oldestMulti %u (offsets segment %" PRIx64 "), "
+		 "oldestOffset %" PRIu64 " (members segment %" PRIx64 ")",
+		 newOldestMulti,
 		 MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
-		 MXOffsetToMemberSegment(oldestOffset),
+		 newOldestOffset,
 		 MXOffsetToMemberSegment(newOldestOffset));
 
 	/*
@@ -2760,8 +2677,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
-						  oldestMulti, newOldestMulti,
-						  oldestOffset, newOldestOffset);
+						  newOldestMulti,
+						  newOldestOffset);
 
 	/*
 	 * Update in-memory limits before performing the truncation, while inside
@@ -2778,10 +2695,10 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
-	PerformMembersTruncation(oldestOffset, newOldestOffset);
+	PerformMembersTruncation(newOldestOffset);
 
 	/* Then offsets */
-	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
+	PerformOffsetsTruncation(newOldestMulti);
 
 	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
 
@@ -2860,19 +2777,15 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
  */
 static void
 WriteMTruncateXlogRec(Oid oldestMultiDB,
-					  MultiXactId startTruncOff, MultiXactId endTruncOff,
-					  MultiXactOffset startTruncMemb, MultiXactOffset endTruncMemb)
+					  MultiXactId oldestMulti,
+					  MultiXactOffset oldestOffset)
 {
 	XLogRecPtr	recptr;
 	xl_multixact_truncate xlrec;
 
 	xlrec.oldestMultiDB = oldestMultiDB;
-
-	xlrec.startTruncOff = startTruncOff;
-	xlrec.endTruncOff = endTruncOff;
-
-	xlrec.startTruncMemb = startTruncMemb;
-	xlrec.endTruncMemb = endTruncMemb;
+	xlrec.oldestMulti = oldestMulti;
+	xlrec.oldestOffset = oldestOffset;
 
 	XLogBeginInsert();
 	XLogRegisterData(&xlrec, SizeOfMultiXactTruncate);
@@ -2943,14 +2856,12 @@ multixact_redo(XLogReaderState *record)
 			   SizeOfMultiXactTruncate);
 
 		elog(DEBUG1, "replaying multixact truncation: "
-			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
-			 xlrec.startTruncOff, xlrec.endTruncOff,
-			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
-			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
-			 MXOffsetToMemberSegment(xlrec.startTruncMemb),
-			 MXOffsetToMemberSegment(xlrec.endTruncMemb));
+			 "oldestMulti %u (offsets segment %" PRIx64 "), "
+			 "oldestOffset %" PRIu64 " (members segment %" PRIx64 ")",
+			 xlrec.oldestMulti,
+			 MultiXactIdToOffsetSegment(xlrec.oldestMulti),
+			 xlrec.oldestOffset,
+			 MXOffsetToMemberSegment(xlrec.oldestOffset));
 
 		/* should not be required, but more than cheap enough */
 		LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
@@ -2959,19 +2870,19 @@ multixact_redo(XLogReaderState *record)
 		 * Advance the horizon values, so they're current at the end of
 		 * recovery.
 		 */
-		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
+		SetMultiXactIdLimit(xlrec.oldestMulti, xlrec.oldestMultiDB);
 
-		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
+		PerformMembersTruncation(xlrec.oldestOffset);
 
 		/*
 		 * During XLOG replay, latest_page_number isn't necessarily set up
 		 * yet; insert a suitable value to bypass the sanity test in
 		 * SimpleLruTruncate.
 		 */
-		pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
+		pageno = MultiXactIdToOffsetPage(xlrec.oldestMulti);
 		pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
 							pageno);
-		PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
+		PerformOffsetsTruncation(xlrec.oldestMulti);
 
 		LWLockRelease(MultiXactTruncationLock);
 	}
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 26f6bd2890e..58ef7ad59df 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -83,13 +83,11 @@ typedef struct xl_multixact_truncate
 {
 	Oid			oldestMultiDB;
 
-	/* to-be-truncated range of multixact offsets */
-	MultiXactId startTruncOff;	/* just for completeness' sake */
-	MultiXactId endTruncOff;
+	/* truncate multixact offsets older than this  */
+	MultiXactId oldestMulti;
 
-	/* to-be-truncated range of multixact members */
-	MultiXactOffset startTruncMemb;
-	MultiXactOffset endTruncMemb;
+	/* truncate multixact members older than this */
+	MultiXactOffset oldestOffset;
 } xl_multixact_truncate;
 
 #define SizeOfMultiXactTruncate (sizeof(xl_multixact_truncate))
-- 
2.47.3

From 02e8a63bded29edecffa3e94d6ab71174ef41770 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Tue, 6 Jan 2026 13:44:46 +0200
Subject: [PATCH v1 2/2] Add check for invalid offset at multixid truncation

If a multixid with zero offset is left behind after a crash, and that
multixid later becomes the oldest multixid, truncation might try to
look up its offset and read the zero value. In the worst case, we
might incorrectly use the zero offset to truncate valid SLRU segments
that are still needed. I'm not sure if that can happen in practice, or
if there are some other lower-level safeguards or incidental reasons
that prevent the caller from passing an unwritten multixid as the
oldest multi. But in any case, let's add an explicit check for it.

Discussion: https://www.postgresql.org/message-id/[email protected]
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 39e07d5924f..5150ab1ecbd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2650,6 +2650,23 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 		return;
 	}
 
+	/*
+	 * On crash, MultiXactIdCreateFromMembers() can leave behind multixids
+	 * that were not yet written out and hence have zero offset on disk. If
+	 * such a multixid becomes oldestMulti, we won't be able to look up its
+	 * offset. That should be rare, so we don't try to do anything smart about
+	 * it. Just skip the truncation, and hope that by the next truncation
+	 * attempt, oldestMulti has advanced to a valid multixid.
+	 */
+	if (newOldestOffset == 0)
+	{
+		ereport(LOG,
+				(errmsg("cannot truncate up to MultiXact %u because it has invalid offset, skipping truncation",
+						newOldestMulti)));
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
+
 	elog(DEBUG1, "performing multixact truncation: "
 		 "oldestMulti %u (offsets segment %" PRIx64 "), "
 		 "oldestOffset %" PRIu64 " (members segment %" PRIx64 ")",
-- 
2.47.3

Reply via email to