On Thu, Mar 5, 2026 at 7:00 AM Andrey Borodin <[email protected]> wrote:
> Thanks for the review! I've addressed all these issues in patch steps 4 and
> 5. Other steps are intact. PFA.
Hi Andrey and Wenboo,
1. I tried the v4 to see how it would cope with heap missing segments[0]
(with -Dsegsize_blocks=6), so given:
create table ptest(id bigint, tenant_id bigint);
insert into ptest select g, mod(g,10) from generate_series(1, 3000) g;
alter table ptest add primary key (id);
checkpoint;
postgres=# select count(*) from ptest;
count
-------
3000
-- simulate loss of middle segment:
$ mv base/5/24578.1 base/5/24578.1.ORG
$ ls -1 base/5/24578*
base/5/24578
base/5/24578.1.ORG
base/5/24578.2
base/5/24578_fsm
base/5/24578_vm
-- force _mfd_openseg() via new backend to show silent data loss:
postgres=# select count(*) from ptest;
count
-------
1110
-- try new amcheck
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>false);
bt_index_check
----------------
-- try the new indexallkeysmatch
-- might need restart before it is able to properly bailout
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>true);
ERROR: could not open file "base/5/24578.1" (target block 6): No such
file or directory
this error was not quite expected for me (altough it is much better than nothing
as on master today), because patch v4-0002 assumes in
bt_verify_index_tuple_points_to_heap() that we should get false and error in
more human-firendly way ("index tuple points to non-existent heap"). The error
itself is coming out of the following stacktrace:
_mdfd_getseg (forknum=MAIN_FORKNUM, blkno=6)
<- mdstartreadv (..) <- smgrstartreadv (blocknum=blocknum@entry=6 (!))
<- StartReadBuffersImpl (blockNum=6 (!)) <- StartReadBuffer()
<- ReadBuffer_common() <- ReadBufferExtended (blockNum=6 (!))
<- heap_fetch() <- heapam_fetch_row_version()
<- table_tuple_fetch_row_version()
<- bt_verify_index_tuple_points_to_heap(targetblock=5, offset=14)
<- bt_target_page_check() <- bt_check_level_from_leftmost()
<- bt_check_every_level() <- bt_index_check_callback()
<- amcheck_lock_relation_and_check() <- bt_index_check()
and it's because itup in bt_target_page_check() has bi_lo = 6, ip_posid = 1
So with the attached patch it becomes more human understandable, but it is
somehow orthogonal check? Still maybe we can combine this under this $thread
ambrella and make it as an option of indexallkeysmatch too? Attached
behaves likes this:
postgres=# select bt_index_check(index=>'ptest_pkey');
ERROR: index line pointer in index "ptest_pkey" points to missing
page in table "ptest"
DETAIL: Index tid=(5,14) points to heap tid=(6,1) but heap has only 6 blocks.
HINT: this can be caused by lost relation segment (missing or removed file).
(note it does not need to indexallkeysmatch, but it's related)
Also, the attached patch right now blows up pg_amcheck test (not
contrib/amcheck), where we ask to ignore certain corrupted schemas/tables to
test exclusion logic, but we still verify btrees and with attached this btree
verification ends up discovering short heap segment :D I honestly do not know
what to do in this situation (we ask to ignore one table T, but still launch
bt_index_check() on it's index and discover illegal sitation about state of T).
Also I'm not sure if the rechecking of heapnblocks is good way. It's just some
idea.
2. For some reason email from Wenbo didn't get to my emailbox, but I can
see his email here:
https://www.postgresql.org/message-id/resend/ba92ac77-24f1-44ad-abf0-11e64e0a7831%40gmail.com
(and resending is is somewhat not effective)
> I've been reviewing the patch and ran some concurrent tests against it.
> I found an issue with false positive corruption reports under concurrent
> VACUUM.
>
> + slot = table_slot_create(state->heaprel, NULL);
> + found = table_tuple_fetch_row_version(state->heaprel, tid,
> + SnapshotAny, slot);
> + if (!found)
> + {
> + ExecDropSingleTupleTableSlot(slot);
> + ereport(ERROR,
>
>
> bt_verify_index_tuple_points_to_heap uses SnapshotAny with
> table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
> "tuple exists but is dead". However, bt_index_check only holds
> AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
> A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
> Bloom filter probe and the heap lookup. Causing
> table_tuple_fetch_row_version to return false for a legal
> dead-but-now-pruned tuple and a false positive corruption report.
>
> The table_tuple_satisfies_snapshot check does not help, it only runs
> when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
> heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
> snapshot checks.
>
> The reproduce should be: DELETE all rows form a big table, then launch
> VACUUM and bt_index_check concurrently. A POC test script attached.
I think this is spot-on. I haven't run repro, but if we take ASL and VACUUM
is allowed to run (under ShareUpdateExclusiveLock) then amcheck performs the
heap bloom building (but ignores dead tuples, while btree still can
point to those),
but VACUUM could later remove the heap tuple. It seems to be impossible (under
ASL) to differentiate orphaned index tuple from concurrently vacuumed
dead tuple?
Maybe we should grab ShareUpdateExclusiveLock in bt_index_check for
indexallkeysmatch==true? With some new documentation warning about
this? (to block
just VACUUM, but not DML)
Alternative is to do it via bt_index_parent_check()/ShareLock, but it's kind of
hardly acceptable in many sitations.
3. This is more a question than finding: assuming extremly big tables, wouldn't
we benefit from some form of caching for EState in
bt_verify_index_tuple_points_to_heap()? Or we do not care about such stuff
in amcheck? (we seem to re-create Estate with each call to FormIndexDatum())
-J.
[0] -
https://www.postgresql.org/message-id/flat/013D63E2-5D75-492E-85FF-1D5CC0148C82%40gmail.com
From f6d6f61e992820a16708b50d8ecba91567f55e0c Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Wed, 8 Apr 2026 14:01:59 +0200
Subject: [PATCH v4b] amcheck: attempt to detect lost heap segments based on
index tuples
bt_index_check() family of functions reads the whole btree index
and builds bloom filter along the way. Later it reads the heap table
and for every tuple there it probes using bloom filter if the index
has entry or not. However in conditions where some segments of the
main fork have been removed as discussed in [0] (e.g. by rogue action,
filesystem corruption/fsck, missing full backup restore) this logic
- after startup - cannot reliably detect that relation is 'short' as the
RelationGetNumberOfBlocks() will stop counting on the last available
heap segment, therefore not full heap segment will be probed against
btree bloom filter.
This patch teaches amcheck to verify during btree scan if the currently
processed index tuple does not point to heap block above current relation
size.
[0] - https://www.postgresql.org/message-id/flat/013D63E2-5D75-492E-85FF-1D5CC0148C82%40gmail.com
Author: Jakub Wartak <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/[email protected]
---
contrib/amcheck/meson.build | 1 +
.../t/008_verify_nbtree_lost_segments.pl | 115 ++++++++++++++++++
contrib/amcheck/verify_nbtree.c | 51 ++++++++
3 files changed, 167 insertions(+)
create mode 100644 contrib/amcheck/t/008_verify_nbtree_lost_segments.pl
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index 220b1ce1d59..28e3e17b447 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -52,6 +52,7 @@ tests += {
't/005_pitr.pl',
't/006_verify_gin.pl',
't/007_verify_nbtree_indexallkeysmatch.pl',
+ 't/008_verify_nbtree_lost_segments.pl',
],
},
}
diff --git a/contrib/amcheck/t/008_verify_nbtree_lost_segments.pl b/contrib/amcheck/t/008_verify_nbtree_lost_segments.pl
new file mode 100644
index 00000000000..7a02f106693
--- /dev/null
+++ b/contrib/amcheck/t/008_verify_nbtree_lost_segments.pl
@@ -0,0 +1,115 @@
+
+# Copyright (c) 2023-2026, PostgreSQL Global Development Group
+
+# This regression test checks the behavior of the btree validation in the
+# presence of missing relation segments.
+#
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Create two tables, one with unique index and another to test
+# posting list (btree duplicates).
+$node->safe_psql(
+ 'postgres', q(
+ CREATE EXTENSION amcheck;
+ CREATE TABLE missingsegs_test1 AS
+ SELECT * FROM generate_series(1, 3000) id;
+ CREATE TABLE missingsegs_test2 AS
+ SELECT 10 AS id FROM generate_series(1, 3000);
+ CREATE UNIQUE INDEX bttest_unique_idx1 ON missingsegs_test1 (id);
+ CREATE INDEX bttest_idx2 ON missingsegs_test2 (id);
+));
+
+my ($result, $stdout, $stderr);
+
+# We have not yet broken the index, so we should get no corruption
+$result = $node->safe_psql(
+ 'postgres', q(
+ SELECT bt_index_check('bttest_unique_idx1', true, true, true);
+));
+is($result, '', 'run amcheck on non-broken bttest_unique_idx1');
+
+$result = $node->safe_psql(
+ 'postgres', q(
+ SELECT bt_index_check('bttest_idx2', true, true, true);
+));
+is($result, '', 'run amcheck on non-broken bttest_idx2');
+
+# Break the relations, simulating rogue action or just fsck moving files
+# into the /lost+found.
+my $relpath1 = relation_filepath('missingsegs_test1');
+my $relpath2 = relation_filepath('missingsegs_test2');
+$node->stop;
+corrupt_segment($relpath1.".1");
+corrupt_segment($relpath2.".1");
+$node->start;
+
+$result = $node->safe_psql(
+ 'postgres', q(
+ SET enable_indexscan TO off;
+ SET enable_indexonlyscan TO off;
+ SELECT count(id) FROM missingsegs_test1;
+));
+cmp_ok(
+ '3000', '>', $result,
+ "ensure there is missing data on missingsegs_test1");
+
+$result = $node->safe_psql(
+ 'postgres', q(
+ SET enable_indexscan TO off;
+ SET enable_indexonlyscan TO off;
+ SELECT count(id) FROM missingsegs_test2;
+));
+cmp_ok(
+ '3000', '>', $result,
+ "ensure there is missing data on missingsegs_test2");
+
+($result, $stdout, $stderr) = $node->psql(
+ 'postgres', q(SELECT bt_index_check('bttest_unique_idx1', true, true, true);)
+);
+like(
+ $stderr,
+ qr/index line pointer in index "bttest_unique_idx1" points to missing page in table "missingsegs_test1"/,
+ 'detected corrupted segments for missingsegs_test1');
+
+($result, $stdout, $stderr) = $node->psql(
+ 'postgres', q(SELECT bt_index_check('bttest_idx2', true, true, true);)
+);
+like(
+ $stderr,
+ qr/index line pointer in index "bttest_idx2" points to missing page in table "missingsegs_test2"/,
+ 'detected corrupted segments for missingsegs_test2');
+
+$node->stop;
+done_testing();
+
+# Returns the filesystem path for the named relation.
+sub relation_filepath
+{
+ my ($relname) = @_;
+
+ my $pgdata = $node->data_dir;
+ my $rel = $node->safe_psql('postgres',
+ qq(SELECT pg_relation_filepath('$relname')));
+ die "path not found for relation $relname" unless defined $rel;
+ return "$pgdata/$rel";
+}
+
+# Rename segment so that it is in accessible
+sub corrupt_segment
+{
+ my ($relpath) = @_;
+ my $destrelpath = $relpath . ".BAK";
+
+ rename($relpath, $destrelpath)
+ or BAIL_OUT("rename failed: $!");
+}
+
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7243b83977d..41176e98c59 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -149,6 +149,8 @@ typedef struct BtreeCheckState
bloom_filter *heapfilter;
/* Debug counter for index tuples verified */
int64 indextuplesverified;
+ /* Short heap segments verification */
+ BlockNumber heapnblocks;
} BtreeCheckState;
/*
@@ -441,6 +443,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->rel = rel;
state->heaprel = heaprel;
state->heapkeyspace = heapkeyspace;
+ state->heapnblocks = RelationGetNumberOfBlocks(state->heaprel);
state->readonly = readonly;
state->heapallindexed = heapallindexed;
state->indexallkeysmatch = indexallkeysmatch;
@@ -1608,6 +1611,54 @@ bt_target_page_check(BtreeCheckState *state)
}
}
+ /* Check if leaf page tuples point to valid heap tuples */
+ if(P_ISLEAF(topaque) && !ItemIdIsDead(itemid)) {
+ int nposting = 1;
+
+ if(BTreeTupleIsPosting(itup))
+ nposting = BTreeTupleGetNPosting(itup);
+
+ for (int i = 0; i < nposting; i++)
+ {
+ ItemPointer htid;
+ BlockNumber heapblk;
+ OffsetNumber heapoff;
+
+ if (nposting > 1)
+ htid = BTreeTupleGetPostingN(itup, i);
+ else
+ htid = BTreeTupleGetPointsToTID(itup);
+
+ heapblk = ItemPointerGetBlockNumber(htid);
+ heapoff = ItemPointerGetOffsetNumber(htid);
+
+ /*
+ * Does heapblk goes beyond RelationGetNumberOfBlocks() - potentially
+ * indicating missing relation segment?
+ */
+ if (heapblk >= state->heapnblocks) {
+ /* We may need to recheck our cached value as we operate with ASL */
+ state->heapnblocks = RelationGetNumberOfBlocks(state->heaprel);
+ if(heapblk >= state->heapnblocks) {
+ char *postingoff = "";
+ if(nposting > 1)
+ postingoff = psprintf("posting list offset=%d", i);
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index line pointer in index \"%s\" points to missing page in table \"%s\"",
+ RelationGetRelationName(state->rel),
+ RelationGetRelationName(state->heaprel)),
+ errdetail_internal("Index tid=(%u,%u) %s points to heap tid=(%u,%u) but heap has only %u blocks.",
+ state->targetblock, offset, postingoff,
+ heapblk, heapoff,
+ state->heapnblocks)),
+ errhint("this can be caused by lost relation segment (missing or removed file)."));
+ }
+ }
+ }
+ }
+
/* Verify each index tuple points to heap tuple with same key */
if (state->indexallkeysmatch && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
{
--
2.43.0