Re: [HACKERS] A design for amcheck heapam verification

Peter Geoghegan Wed, 06 Sep 2017 19:28:01 -0700

On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
> <alvhe...@2ndquadrant.com> wrote:
>> Eh, if you want to optimize it for the case where debug output is not
>> enabled, make sure to use ereport() not elog().  ereport()
>> short-circuits evaluation of arguments, whereas elog() does not.
>
> I should do that, but it's still not really noticeable.


Since this patch has now bit-rotted, I attach a new revision, V2.
Apart from fixing some Makefile bitrot, this revision also makes some
small tweaks as suggested by Thomas and Alvaro. The documentation is
also revised and expanded, and now discusses practical aspects of the
set membership being tested using a Bloom filter, how that relates to
maintenance_work_mem, and so on.

Note that this revision does not let the Bloom filter caller use their
own dynamic shared memory, which is something that Thomas asked about.
While that could easily be added, I think it should happen later. I
really just wanted to make sure that my Bloom filter was not in some
way fundamentally incompatible with Thomas' planned enhancements to
(parallel) hash join.

-- 
Peter Geoghegan

From d4dda95dd41204315dc12936fac83d2df8676992 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.
---
 src/backend/lib/Makefile      |   4 +-
 src/backend/lib/README        |   2 +
 src/backend/lib/bloomfilter.c | 297 ++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/bloomfilter.h |  26 ++++
 4 files changed, 327 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe4..191ea9b 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e..376ae27 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 0000000..e93f9b0
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,297 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.c
+ *		Minimal Bloom filter
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/bloomfilter.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/hash.h"
+#include "lib/bloomfilter.h"
+#include "utils/memutils.h"
+
+#define MAX_HASH_FUNCS	10
+#define NBITS(filt)		((1 << (filt)->bloom_power))
+
+typedef struct bloom_filter
+{
+	/* 2 ^ bloom_power is the size of the bitset (in bits) */
+	int				bloom_power;
+	unsigned char  *bitset;
+
+	/* K hash functions are used, which are randomly seeded */
+	int				k_hash_funcs;
+	uint32			seed;
+} bloom_filter;
+
+static int pow2_truncate(int64 target_bitset_size);
+static int optimal_k(int64 bits, int64 total_elems);
+static void k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem,
+					 size_t len);
+static uint32 sdbmhash(unsigned char *elem, size_t len);
+
+/*
+ * Initialize Bloom filter.  This should get a false positive rate of between
+ * 1% and 2% when its bitset is not constrained by memory.
+ *
+ * total_elems is an estimate of the final size of the set.  It ought to be
+ * approximately correct, but we can cope well with it being off by perhaps a
+ * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
+ * (Dillinger & Manolios, 2004) for details of why this is the case.
+ *
+ * work_mem is sized in KB, in line with the general convention.
+ *
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
+ *
+ * Notes on appropriate use:
+ *
+ * To keep the implementation simple and predictable, the underlying bitset is
+ * always sized as a power-of-two number of bits, and the largest possible
+ * bitset is 2 ^ 30 bits, or 128MB.  The implementation is therefore well
+ * suited to data synchronization problems between unordered sets, where
+ * predictable performance is more important than worst case guarantees around
+ * false positives.  Another problem that the implementation is well suited for
+ * is cache filtering where good performance already relies upon having a
+ * relatively small and/or low cardinality set of things that are interesting
+ * (with perhaps many more uninteresting things that never populate the
+ * filter).
+ */
+bloom_filter *
+bloom_init(int64 total_elems, int work_mem, uint32 seed)
+{
+	bloom_filter   *filter;
+	int64			bitset_bytes;
+	int64			bitset_bits;
+
+	filter = palloc(sizeof(bloom_filter));
+
+	/*
+	 * Aim for two bytes per element; this is sufficient to get a false
+	 * positive rate below 1%, independent of the size of the bitset or total
+	 * number of elements.  Also, if rounding down the size of the bitset to
+	 * the next lowest power of two turns out to be a significant drop, the
+	 * false positive rate still won't exceed 2% in almost all cases.
+	 */
+	bitset_bytes = Min(total_elems * 2, MaxAllocSize);
+	bitset_bytes = Min(work_mem * 1024L, bitset_bytes);
+	/* Minimum allowable size is 1MB */
+	bitset_bytes = Max(1024L * 1024L, bitset_bytes);
+
+	/* Size in bits should be the highest power of two within budget */
+	filter->bloom_power = pow2_truncate(bitset_bytes * BITS_PER_BYTE);
+	bitset_bits = NBITS(filter);
+	bitset_bytes = bitset_bits / BITS_PER_BYTE;
+	filter->bitset = palloc0(bitset_bytes);
+	filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+	filter->seed = seed;
+
+	return filter;
+}
+
+/*
+ * Free Bloom filter
+ */
+void
+bloom_free(bloom_filter *filter)
+{
+	pfree(filter->bitset);
+	pfree(filter);
+}
+
+/*
+ * Add element to Bloom filter
+ */
+void
+bloom_add_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		filter->bitset[hashes[i] >> 3] |= 1 << (hashes[i] & 7);
+	}
+}
+
+/*
+ * Test if Bloom filter definitely lacks element.
+ *
+ * Returns true if the element is definitely not in the set of elements
+ * observed by bloom_add_element().  Otherwise, returns false, indicating that
+ * element is probably present in set.
+ */
+bool
+bloom_lacks_element(bloom_filter *filter, unsigned char *elem, size_t len)
+{
+	uint32	hashes[MAX_HASH_FUNCS];
+	int		i;
+
+	k_hashes(filter, hashes, elem, len);
+
+	/* Map a bit-wise address to a byte-wise address + bit offset */
+	for (i = 0; i < filter->k_hash_funcs; i++)
+	{
+		if (!(filter->bitset[hashes[i] >> 3] & (1 << (hashes[i] & 7))))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
+ * This is a useful, generic indicator of whether or not a Bloom filter has
+ * summarized the set optimally within the available memory budget.  If return
+ * value exceeds 0.5 significantly, then that's either because there was a
+ * dramatic underestimation of set size by the caller, or because available
+ * work_mem is very low relative to the size of the set (less than 2 bits per
+ * element).
+ *
+ * Note that the value returned here should generally be close to 0.5, even
+ * when we have more than enough memory to ensure a false positive rate within
+ * our target 1% - 2% band, since more hash functions are used as more memory
+ * is available per element.
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
+{
+	int		bitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+	int64	bits_set = 0;
+	int		i;
+
+	for (i = 0; i < bitset_bytes; i++)
+	{
+		unsigned char byte = filter->bitset[i];
+
+		while (byte)
+		{
+			bits_set++;
+			byte &= (byte - 1);
+		}
+	}
+
+	return bits_set / (double) NBITS(filter);
+}
+
+/*
+ * Which element of the sequence of powers-of-two is less than or equal to n?
+ *
+ * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30
+ * bits (128MB).  This frees us from giving further consideration to int
+ * overflow.
+ */
+static int
+pow2_truncate(int64 target_bitset_size)
+{
+	int v = 0;
+
+	while (target_bitset_size > 0)
+	{
+		v++;
+		target_bitset_size = target_bitset_size >> 1;
+	}
+
+	return Min(v - 1, 30);
+}
+
+/*
+ * Determine optimal number of hash functions based on size of filter in bits,
+ * and projected total number of elements.  The optimal number is the number
+ * that minimizes the false positive rate.
+ */
+static int
+optimal_k(int64 bits, int64 total_elems)
+{
+	int		k = round(log(2.0) * bits / total_elems);
+
+	return Max(1, Min(k, MAX_HASH_FUNCS));
+}
+
+/*
+ * Generate k hash values for element.
+ *
+ * Caller passes array, which is filled-in with k values determined by hashing
+ * caller's element.
+ *
+ * Only 2 real independent hash functions are actually used to support an
+ * interface of up to MAX_HASH_FUNCS hash functions; "enhanced double hashing"
+ * is used to make this work.  See Dillinger & Manolios for details of why
+ * that's okay.  "Building a Better Bloom Filter" by Kirsch & Mitzenmacher also
+ * has detailed analysis of the algorithm.
+ */
+static void
+k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
+{
+	uint32	hasha,
+			hashb;
+	int		i;
+
+	hasha = DatumGetUInt32(hash_any(elem, len));
+	hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
+
+	/* Mix seed value */
+	hasha += filter->seed;
+	/* Apply "MOD m" to avoid losing bits/out-of-bounds array access */
+	hasha = hasha % NBITS(filter);
+	hashb = hashb % NBITS(filter);
+
+	/* First hash */
+	hashes[0] = hasha;
+
+	/* Subsequent hashes */
+	for (i = 1; i < filter->k_hash_funcs; i++)
+	{
+		hasha = (hasha + hashb) % NBITS(filter);
+		hashb = (hashb + i) % NBITS(filter);
+
+		/* Accumulate hash value for caller */
+		hashes[i] = hasha;
+	}
+}
+
+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, size_t len)
+{
+	uint32	hash = 0;
+	int		i;
+
+	for (i = 0; i < len; elem++, i++)
+	{
+		hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+	}
+
+	return hash;
+}
diff --git a/src/include/lib/bloomfilter.h b/src/include/lib/bloomfilter.h
new file mode 100644
index 0000000..09a5501
--- /dev/null
+++ b/src/include/lib/bloomfilter.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * bloomfilter.h
+ *	  Minimal Bloom filter
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/include/lib/bloomfilter.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BLOOMFILTER_H_
+#define _BLOOMFILTER_H_
+
+typedef struct bloom_filter bloom_filter;
+
+extern bloom_filter *bloom_init(int64 total_elems, int work_mem, uint32 seed);
+extern void bloom_free(bloom_filter *filter);
+extern void bloom_add_element(bloom_filter *filter, unsigned char *elem,
+							  size_t len);
+extern bool bloom_lacks_element(bloom_filter *filter, unsigned char *elem,
+								size_t len);
+extern double bloom_prop_bits_set(bloom_filter *filter);
+
+#endif
-- 
2.7.4

From c68bcd52a3268e0bf09c04f0b794abbcf8474d32 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql    |  28 ++++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql      |   9 +-
 contrib/amcheck/verify_nbtree.c          | 237 ++++++++++++++++++++++++++++---
 doc/src/sgml/amcheck.sgml                | 149 +++++++++++++++----
 7 files changed, 385 insertions(+), 56 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 0000000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index df3741e..42872b8 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -16,8 +16,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
  bt_index_check 
@@ -56,8 +56,14 @@ SELECT bt_index_check('bttest_a_idx');
  
 (1 row)
 
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx', true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index fd90531..5d27969 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -19,8 +19,8 @@ RESET ROLE;
 -- we, intentionally, don't check relation permissions - it's useful
 -- to run this cluster-wide with a restricted account, and as tested
 -- above explicit permission has to be granted for that.
-GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
-GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass, boolean) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parent_check(regclass, boolean) TO bttest_role;
 SET ROLE bttest_role;
 SELECT bt_index_check('bttest_a_idx');
 SELECT bt_index_parent_check('bttest_a_idx');
@@ -42,8 +42,9 @@ ROLLBACK;
 
 -- normal check outside of xact
 SELECT bt_index_check('bttest_a_idx');
--- more expansive test
-SELECT bt_index_parent_check('bttest_b_idx');
+-- more expansive tests
+SELECT bt_index_check('bttest_a_idx', true);
+SELECT bt_index_parent_check('bttest_b_idx', true);
 
 BEGIN;
 SELECT bt_index_check('bttest_a_idx');
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9ae83dc..346d788 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -8,6 +8,11 @@
  * (the insertion scankey sort-wise NULL semantics are needed for
  * verification).
  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *
  *
  * Copyright (c) 2017, PostgreSQL Global Development Group
  *
@@ -18,13 +23,16 @@
  */
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/transam.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "commands/tablecmds.h"
+#include "lib/bloomfilter.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -53,10 +61,15 @@ typedef struct BtreeCheckState
 	 * Unchanging state, established at start of verification:
 	 */
 
-	/* B-Tree Index Relation */
+	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
+	Relation	heaprel;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
+	/* verifying heap has no unindexed tuples? */
+	bool		heapallindexed;
+	/* Oldest xmin before index examined (for !readonly + heapallindexed calls) */
+	TransactionId	oldestxmin;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -72,6 +85,15 @@ typedef struct BtreeCheckState
 	BlockNumber targetblock;
 	/* Target page's LSN */
 	XLogRecPtr	targetlsn;
+
+	/*
+	 * Mutable state, for optional heapallindexed verification:
+	 */
+
+	/* Bloom filter fingerprints B-Tree index */
+	bloom_filter *filter;
+	/* Debug counter */
+	int64		heaptuplespresent;
 } BtreeCheckState;
 
 /*
@@ -92,15 +114,20 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
-static void bt_index_check_internal(Oid indrelid, bool parentcheck);
+static void bt_index_check_internal(Oid indrelid, bool parentcheck,
+									bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool readonly);
+static void bt_check_every_level(Relation rel, Relation heaprel,
+								 bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
 static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 				  ScanKey targetkey);
+static void bt_tuple_present_callback(Relation index, HeapTuple htup,
+									  Datum *values, bool *isnull,
+									  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
@@ -116,37 +143,47 @@ static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires AccessShareLock on heap & index relations.  Does not consider
- * invariants that exist between parent/child pages.
+ * invariants that exist between parent/child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, false);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, false, heapallindexed);
 
 	PG_RETURN_VOID();
 }
 
 /*
- * bt_index_parent_check(index regclass)
+ * bt_index_parent_check(index regclass, heapallindexed boolean)
  *
  * Verify integrity of B-Tree index.
  *
  * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
- * parent pages are valid lower bounds on child pages.
+ * parent pages are valid lower bounds on child pages.  Optionally verifies
+ * that heap does not contain any unindexed or incorrectly indexed tuples.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+	bool		heapallindexed = false;
 
-	bt_index_check_internal(indrelid, true);
+	if (PG_NARGS() == 2)
+		heapallindexed = PG_GETARG_BOOL(1);
+
+	bt_index_check_internal(indrelid, true, heapallindexed);
 
 	PG_RETURN_VOID();
 }
@@ -155,7 +192,7 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -205,7 +242,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck)
 	btree_index_checkable(indrel);
 
 	/* Check index */
-	bt_check_every_level(indrel, parentcheck);
+	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -253,11 +290,14 @@ btree_index_checkable(Relation rel)
 
 /*
  * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
- * logical order, verifying invariants as it goes.
+ * logical order, verifying invariants as it goes.  Optionally, verification
+ * checks if the heap relation contains any tuples that are not represented in
+ * the index but should be.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
  * the index relation, and advise us if extra checks are safe when a ShareLock
- * is held.
+ * is held.  (A lock of the same type must also have been acquired on the heap
+ * relation.)
  *
  * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
@@ -272,7 +312,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, bool readonly)
+bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
+					 bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -291,7 +332,34 @@ bt_check_every_level(Relation rel, bool readonly)
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
+	state->heaprel = heaprel;
 	state->readonly = readonly;
+	state->heapallindexed = heapallindexed;
+	state->oldestxmin = InvalidTransactionId;
+
+	if (state->heapallindexed)
+	{
+		int64	total_elems;
+		uint32	seed;
+
+		/*
+		 * When only AccessShareLock held on heap, get oldestxmin before index
+		 * is first accessed.  Used for later visibility rechecks, within
+		 * bt_tuple_present_callback().
+		 */
+		if (!state->readonly)
+			state->oldestxmin = GetOldestXmin(state->heaprel,
+											  PROCARRAY_FLAGS_VACUUM);
+
+		/* Size Bloom filter based on estimated number of tuples in index */
+		total_elems = (int64) state->rel->rd_rel->reltuples;
+		/* Random seed relies on backend srandom() call to avoid repetition */
+		seed = random();
+		/* Create Bloom filter to fingerprint index */
+		state->filter = bloom_init(total_elems, maintenance_work_mem, seed);
+		state->heaptuplespresent = 0;
+	}
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -347,6 +415,41 @@ bt_check_every_level(Relation rel, bool readonly)
 		previouslevel = current.level;
 	}
 
+	/*
+	 * * Heap contains unindexed/malformed tuples check *
+	 */
+	if (state->heapallindexed)
+	{
+		IndexInfo  *indexinfo;
+
+		elog(DEBUG1, "verifying presence of required tuples in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+		indexinfo = BuildIndexInfo(state->rel);
+
+		/*
+		 * Since we're not actually indexing, don't enforce uniqueness/wait for
+		 * concurrent insertion to finish, even with unique indexes.
+		 *
+		 * Force use of MVCC snapshot (reuse CONCURRENTLY infrastructure) when
+		 * only an AccessShareLock held.  It seems like a good idea to not
+		 * diverge from expected heap lock strength in all cases.  This is
+		 * needed to prevent unhelpful WARNINGs due to concurrent insertions
+		 * that IndexBuildHeapScan() does not expect.
+		 */
+		indexinfo->ii_Unique = false;
+		indexinfo->ii_Concurrent = !state->readonly;
+		IndexBuildHeapScan(state->heaprel, state->rel, indexinfo, true,
+						   bt_tuple_present_callback, (void *) state);
+
+		ereport(DEBUG1,
+				(errmsg_internal("finished verifying presence of " INT64_FORMAT " tuples (proportion of bits set: %f) from table \"%s\"",
+								 state->heaptuplespresent, bloom_prop_bits_set(state->filter),
+								 RelationGetRelationName(heaprel))));
+
+		bloom_free(state->filter);
+	}
+
 	/* Be tidy: */
 	MemoryContextDelete(state->targetcontext);
 }
@@ -499,7 +602,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
 										current, level.level, opaque->btpo.level)));
 
-		/* Verify invariants for page -- all important checks occur here */
+		/* Verify invariants for page */
 		bt_target_page_check(state);
 
 nextpage:
@@ -546,6 +649,9 @@ nextpage:
  *
  * - That all child pages respect downlinks lower bound.
  *
+ * This is also where heapallindexed callers build their Bloom filter for later
+ * verification that index had all heap tuples.
+ *
  * Note:  Memory allocated in this routine is expected to be released by caller
  * resetting state->targetcontext.
  */
@@ -589,6 +695,11 @@ bt_target_page_check(BtreeCheckState *state)
 		itup = (IndexTuple) PageGetItem(state->target, itemid);
 		skey = _bt_mkscankey(state->rel, itup);
 
+		/* When verifying heap, record leaf items in Bloom filter */
+		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
+			bloom_add_element(state->filter, (unsigned char *) itup,
+							  IndexTupleSize(itup));
+
 		/*
 		 * * High key check *
 		 *
@@ -682,8 +793,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Last item check *
 		 *
 		 * Check last item against next/right page's first data item's when
-		 * last item on page is reached.  This additional check can detect
-		 * transposed pages.
+		 * last item on page is reached.  This additional check will detect
+		 * transposed pages iff the supposed right sibling page happens to
+		 * belong before target in the key space.  (Otherwise, a subsequent
+		 * heap verification will probably detect the problem.)
 		 *
 		 * This check is similar to the item order check that will have
 		 * already been performed for every other "real" item on target page
@@ -1062,6 +1175,96 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 }
 
 /*
+ * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
+ * all needed entries using Bloom filter probes.
+ *
+ * The redundancy between an index and the table it indexes provides a good
+ * opportunity to detect corruption in index, and especially in heap.  The high
+ * level principle behind verification performed here is that any index tuples
+ * that should be in the index following a REINDEX should also have been there
+ * all along.  This must be true because a REINDEX rebuilds the index in order
+ * to effectively remove bloat.  There might be dead index tuple entries in the
+ * Bloom filter, because of the lack of reliable visibility information in
+ * index structures, but that hardly matters since we're concerned about the
+ * possible absence of needed tuples.  In other words, a fresh REINDEX should
+ * never affect the representation of any IndexTuple, because these are
+ * immutable for as long as heap tuple is visible to any possible snapshot
+ * (while the LP_DEAD bit is mutable, that's ItemId metadata, which we don't
+ * directly fingerprint).
+ *
+ * Since the overall structure of the index has already been verified, the most
+ * likely explanation for invariant not holding is a corrupt heap page (could
+ * be logical or physical corruption), which is why heap is blamed here.  Heap
+ * corruption is not always the problem, though.  Only readonly callers will
+ * have verified that left links and right links are in agreement, and so it's
+ * possible that a leaf page transposition within index is actually the source
+ * of corruption detected here (for !readonly callers), in which case the
+ * user-visible diagnostic message is misleading.  The checks performed only
+ * for readonly callers might more accurately frame the problem as a bogus leaf
+ * page transposition, or a cross-page invariant not holding due to recovery
+ * not replaying all WAL records.  That's why the !readonly ERROR message
+ * raised here includes a HINT about trying the other variant out.
+ */
+static void
+bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
+						  bool *isnull, bool tupleIsAlive, void *checkstate)
+{
+	BtreeCheckState *state = (BtreeCheckState *) checkstate;
+	IndexTuple		 itup;
+
+	Assert(state->heapallindexed);
+
+	/* Must recheck visibility when only AccessShareLock held */
+	if (!state->readonly)
+	{
+		TransactionId	xmin;
+
+		/*
+		 * Don't test for presence in index where xmin not at least old enough
+		 * that we know for sure that absence of index tuple wasn't just due to
+		 * some transaction performing insertion after our verifying index
+		 * traversal began.  (Actually, the cut-off is based on the point
+		 * before which any possible inserting transaction must have
+		 * committed/aborted.)
+		 *
+		 * You might think that the fact that an MVCC snapshot is used by the
+		 * heap scan (due to indicating that this is the first scan of a CREATE
+		 * INDEX CONCURRENTLY index build) would make this test redundant.
+		 * That's not quite true, because with current IndexBuildHeapScan()
+		 * interface caller cannot do the MVCC snapshot acquisition itself.  In
+		 * this way, heap tuple coverage is similar to the coverage we could
+		 * get by acquiring the MVCC snapshot ourselves at the point where
+		 * GetOldestXmin() is currently called.  It's easier to do this than to
+		 * adopt the IndexBuildHeapScan() interface to our narrow requirements.
+		 */
+		xmin = HeapTupleHeaderGetXmin(htup->t_data);
+		if (!TransactionIdPrecedes(xmin, state->oldestxmin))
+			return;
+	}
+
+	/* Generate an index tuple */
+	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
+	itup->t_tid = htup->t_self;
+
+	/* Probe Bloom filter -- tuple should be present */
+	if (bloom_lacks_element(state->filter, (unsigned char *) itup,
+							IndexTupleSize(itup)))
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("table \"%s\" lacks matching index tuple in index \"%s\" for tid (%u,%u)",
+						RelationGetRelationName(state->heaprel),
+						RelationGetRelationName(state->rel),
+						ItemPointerGetBlockNumber(&(itup->t_tid)),
+						ItemPointerGetOffsetNumber(&(itup->t_tid))),
+				 !state->readonly ?
+				 errhint("Calling bt_index_parent_check() against target \"%s\" may further isolate the inconsistency",
+						 RelationGetRelationName(state->rel)) : 0 ));
+
+	state->heaptuplespresent++;
+	pfree(itup);
+}
+
+/*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
  *
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index dd71dbd..c071ead 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -44,7 +44,7 @@
   <variablelist>
    <varlistentry>
     <term>
-     <function>bt_index_check(index regclass) returns void</function>
+     <function>bt_index_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_check</primary>
      </indexterm>
@@ -55,7 +55,7 @@
       <function>bt_index_check</function> tests that its target, a
       B-Tree index, respects a variety of invariants.  Example usage:
 <screen>
-test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisprimary)
 FROM pg_index i
 JOIN pg_opclass op ON i.indclass[0] = op.oid
 JOIN pg_am am ON op.opcmethod = am.oid
@@ -83,20 +83,23 @@ ORDER BY c.relpages DESC LIMIT 10;
 </screen>
       This example shows a session that performs verification of every
       catalog index in the database <quote>test</>.  Details of just
-      the 10 largest indexes verified are displayed.  Since no error
-      is raised, all indexes tested appear to be logically consistent.
-      Naturally, this query could easily be changed to call
+      the 10 largest indexes verified are displayed.  Verification of
+      the presence of heap tuples as index tuples is requested for
+      primary key indexes only.  Since no error is raised, all indexes
+      tested appear to be logically consistent.  Naturally, this query
+      could easily be changed to call
       <function>bt_index_check</function> for every index in the
       database where verification is supported.
      </para>
      <para>
-      <function>bt_index_check</function> acquires an <literal>AccessShareLock</>
-      on the target index and the heap relation it belongs to. This lock mode
-      is the same lock mode acquired on relations by simple
-      <literal>SELECT</> statements.
+      <function>bt_index_check</function> acquires an
+      <literal>AccessShareLock</> on the target index and the heap
+      relation it belongs to.  This lock mode is the same lock mode
+      acquired on relations by simple <literal>SELECT</> statements.
       <function>bt_index_check</function> does not verify invariants
-      that span child/parent relationships, nor does it verify that
-      the target index is consistent with its heap relation.  When a
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</> is <literal>true</>.  When a
       routine, lightweight test for corruption is required in a live
       production environment, using
       <function>bt_index_check</function> often provides the best
@@ -108,7 +111,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean DEFAULT false) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -117,19 +120,22 @@ ORDER BY c.relpages DESC LIMIT 10;
     <listitem>
      <para>
       <function>bt_index_parent_check</function> tests that its
-      target, a B-Tree index, respects a variety of invariants.  The
-      checks performed by <function>bt_index_parent_check</function>
-      are a superset of the checks performed by
-      <function>bt_index_check</function>.
-      <function>bt_index_parent_check</function> can be thought of as
-      a more thorough variant of <function>bt_index_check</function>:
-      unlike <function>bt_index_check</function>,
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</> argument is
+      <literal>true</>, the function verifies the presence of all heap
+      tuples that should be found within the index.  The checks
+      performed by <function>bt_index_parent_check</function> are a
+      superset of the checks performed by
+      <function>bt_index_check</function> when called with the same
+      options.  <function>bt_index_parent_check</function> can be
+      thought of as a more thorough variant of
+      <function>bt_index_check</function>: unlike
+      <function>bt_index_check</function>,
       <function>bt_index_parent_check</function> also checks
-      invariants that span parent/child relationships.  However, it
-      does not verify that the target index is consistent with its
-      heap relation.  <function>bt_index_parent_check</function>
-      follows the general convention of raising an error if it finds a
-      logical inconsistency or other problem.
+      invariants that span parent/child relationships.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
      </para>
      <para>
       A <literal>ShareLock</> is required on the target index by
@@ -159,6 +165,70 @@ ORDER BY c.relpages DESC LIMIT 10;
  </sect2>
 
  <sect2>
+  <title>Optional <parameter>heapallindexed</> verification</title>
+ <para>
+  When the <parameter>heapallindexed</> argument to verification
+  functions is <literal>true</>, an additional phase of verification
+  is performed against the table associated with the target index
+  relation.  This consists of a <quote>dummy</> <command>CREATE
+  INDEX</> operation, which checks for the presence of all would-be
+  new index tuples against a temporary, in-memory summarizing
+  structure (this is built when needed during the first, standard
+  phase).  The summarizing structure <quote>fingerprints</> every
+  tuple found within the target index.  The high level principle
+  behind <parameter>heapallindexed</> verification is that a new index
+  that is equivalent to the existing, target index must only have
+  entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</> phase adds significant
+  overhead: verification will typically take several times longer than
+  it would with only the standard consistency checking of the target
+  index's structure.  However, verification will still take
+  significantly less time than an actual <command>CREATE INDEX</>.
+  There is no change to the relation-level locks acquired when
+  <parameter>heapallindexed</> verification is performed.  The
+  summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect the
+  absence of any particular index tuple, approximately 2 bytes of
+  memory are needed per index tuple.  As less memory is made available
+  per index tuple, the probability of missing an inconsistency
+  increases.  This is considered an acceptable trade-off, since it
+  limits the overhead of verification very significantly, while only
+  slightly reducing the probability of detecting a problem, especially
+  for installations where verification is treated as a routine
+  maintenance task.
+ </para>
+ <para>
+  In many applications, even the default
+  <varname>maintenance_work_mem</varname> setting of <literal>64MB</>
+  will be sufficient to have less than a 2% probability of overlooking
+  any single absent or corrupt tuple.  This will be the case when
+  there are no indexes with more than about 30 million distinct index
+  tuples, regardless of the overall size of any index, the total
+  number of indexes, or anything else.  False positive candidate tuple
+  membership tests within the summarizing structure occur at random,
+  and are very unlikely to be the same for repeat verification
+  operations.  Furthermore, within a single verification operation,
+  each missing or malformed index tuple independently has the same
+  chance of being detected.  If there is any inconsistency at all, it
+  isn't particularly likely to be limited to a single tuple.  All of
+  these factors favor accepting a limited per operation per tuple
+  probability of missing corruption, in order to enable performing
+  more thorough index to heap verification more frequently (practical
+  concerns about the overhead of verification are likely to limit the
+  frequency of verification).  In aggregate, the probability of
+  detecting a hardware fault or software defect actually
+  <emphasis>increases</> significantly with this strategy in most real
+  world cases.  Moreover, frequent verification allows problems to be
+  caught earlier on average, which helps to limit the overall impact
+  of corruption, and often simplifies root cause analysis.
+ </para>
+
+ </sect2>
+
+ <sect2>
   <title>Using <filename>amcheck</> effectively</title>
 
  <para>
@@ -199,17 +269,31 @@ ORDER BY c.relpages DESC LIMIT 10;
    </listitem>
    <listitem>
     <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</> verification
+     is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be very
+     subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
      Corruption caused by hypothetical undiscovered bugs in the
-     underlying <productname>PostgreSQL</> access method code or sort
-     code.
+     underlying <productname>PostgreSQL</> access method code, sort
+     code, or transaction management code.
     </para>
     <para>
      Automatic verification of the structural integrity of indexes
      plays a role in the general testing of new or proposed
      <productname>PostgreSQL</> features that could plausibly allow a
-     logical inconsistency to be introduced.  One obvious testing
-     strategy is to call <filename>amcheck</> functions continuously
-     when running the standard regression tests.  See <xref
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</> functions continuously when
+     running the standard regression tests.  See <xref
      linkend="regress-run"> for details on running the tests.
     </para>
    </listitem>
@@ -242,6 +326,13 @@ ORDER BY c.relpages DESC LIMIT 10;
      <emphasis>absolute</emphasis> protection against failures that
      result in memory corruption.
     </para>
+    <para>
+     When <parameter>heapallindexed</> is <literal>true</>, and heap
+     verification is performed, there is generally a greatly increased
+     chance of detecting single-bit errors, since strict binary
+     equality is tested, and the indexed attributes within the heap
+     are tested.
+    </para>
    </listitem>
   </itemizedlist>
   In general, <filename>amcheck</> can only prove the presence of
-- 
2.7.4

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

Reply via email to