It seems like a good next step for amcheck would be to add
functionality that verifies that heap tuples have matching index
tuples, and that heap pages are generally sane. I've been thinking
about a design for this for a while now, and would like to present
some tentative ideas before I start writing code.

Using a bloom filter built with hashed index tuples, and then matching
that against the heap is an approach to index/heap consistency
checking that has lots of upsides, but at least one downside. The
downside is pretty obvious: we introduce a low though quantifiable
risk of false negatives (failure to detect corruption due to the
actual absence of an appropriate index tuple). That's not great,
especially because it occurs non-deterministically, but the upsides to
this approach are considerable. Perhaps it's worth it.

Here is the general approach that I have in mind right now:

* Size the bloom filter based on the pg_class.reltuples of the index
to be verified, weighing a practical tolerance for false negatives,
capping with work_mem.
  - We might throw an error immediately if it's impossible to get a
reasonably low probability of false negatives -- for some value of
"reasonable".
* Perform existing verification checks for a B-Tree. While scanning
the index, hash index tuples, including their heap TID pointer. Build
the bloom filter with hashed values as we go.

As we Scan the heap, we:

* Verify that HOT safety was correctly assessed in respect of the
index (or indexes) being verified.
* Test the visibility map, and sanity check MultiXacts [1].
* Probe index/heap match check (uses bloom filter):

   If a heap tuple meets the following conditions:

 - Is not a HOT update tuple.
 - Is committed, and committed before RecentGlobalXmin.
 - Satisfies any predicate (for partial index case).

   Then:

  - Build a would-be index tuple value, perhaps reusing CREATE INDEX code.
  - Hash that in the same manner as in the original index pass.
  - Raise an error if the bloom filter indicates it is not present in the index.

Seems like we should definitely go to the index first, because the
heap is where we'll find visibility information.

If I wanted to go about implementing the same index/heap checks in a
way that does not have the notable downside around false negatives, I
suppose I'd have to invent a new, internal mechanism that performed a
kind of special merge join of an index against the heap. That would be
complicated, and require a lot of code; in other words, it would be
bug-prone. I wouldn't say that amcheck is simple today, but at least
the complexity owes everything to how B-Trees already work, as opposed
to how a bunch of custom infrastructure we had to invent works. The
amount of code in amcheck's verify_nbtree.c file is pretty low, and
that's a good goal to stick with. The very detailed comment that
argues for the correctness of amcheck's bt_right_page_check_scankey()
function is, to a large degree, also arguing for the correctness of a
bunch of code within nbtree. amcheck verification should be
comprehensive, but also simple and minimal, which IMV is a good idea
for about the same reason that that's a good idea when writing unit
tests.

The merge join style approach would also make verification quite
expensive, particularly when one table has multiple indexes. A tool
whose overhead can only really be justified when we're pretty sure
that there is already a problem is *significantly* less useful. And,
it ties verification to the executor, which could become a problem if
we make the functionality into a library that is usable by backup
tools that don't want to go through the buffer manager (or any SQL
interface).

Apart from the low memory overhead of using a bloom filter, resource
management is itself made a lot easier. We won't be sensitive to
misestimations, because we only need an estimate of the number of
tuples in the index, which will tend to be accurate enough in the vast
majority of cases. reltuples is needed to size the bloom filter bitmap
up-front. It doesn't matter how wide individual index tuples turn out
to be, because we simply hash everything, including even the heap TID
contained within the index.

Using a bloom filter makes verification "stackable" in a way that
might become important later. For example, we can later improve
amcheck to verify a table in parallel, by having a parallel worker
verify one index each, with bloom filters built in fixed size shared
memory buffers. A parallel heap scan then has workers concurrently
verify heap pages, and concurrently probe each final bloom filter.
Similarly, everything works the same if we add the option of scanning
a B-Tree in physical order (at the expense of not doing cross-page
verification). And, while I'd start with nbtree, you can still pretty
easily generalize the approach to building the bloom filter across
AMs. All index AMs other than BRIN have index tuples that are
essentially some values that are either the original heap values
themselves, or values that are a function of those original values,
plus a heap TID pointer. So, GIN might compress the TIDs, and
deduplicate the values, and SP-GiST might use its own compression, but
it's pretty easy to build a conditioned IndexTuple binary string that
normalizes away these differences, which is what we actually hash.
When WARM introduces the idea of a RED or BLUE tuple, it may be
possible to add that to the conditioning logic.

(BTW, the more advanced requirements are why I think that at least
some parts of amcheck should eventually end up in core -- most of
these are a long way off, but seem worth thinking about now.)

We can add a random seed value to the mix when hashing, so that any
problem that is masked by hash collisions won't stay masked on repeat
verification attempts. Errors from corruption display this value,
which lets users reliably recreate the original error by explicitly
setting the seed the next time around. Say, when users need to verify
with total confidence that a particular issue has been fixed within a
very large table in production.

I'd like to hear feedback on the general idea, and what the
user-visible interface ought to look like. The non-deterministic false
negatives may need to be considered by the user visible interface,
which is the main reason I mention it.

[1] postgr.es/m/20161017014605.ga1220...@tornado.leadboat.com

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to