It seems like a good next step for amcheck would be to add functionality that verifies that heap tuples have matching index tuples, and that heap pages are generally sane. I've been thinking about a design for this for a while now, and would like to present some tentative ideas before I start writing code.
Using a bloom filter built with hashed index tuples, and then matching that against the heap is an approach to index/heap consistency checking that has lots of upsides, but at least one downside. The downside is pretty obvious: we introduce a low though quantifiable risk of false negatives (failure to detect corruption due to the actual absence of an appropriate index tuple). That's not great, especially because it occurs non-deterministically, but the upsides to this approach are considerable. Perhaps it's worth it. Here is the general approach that I have in mind right now: * Size the bloom filter based on the pg_class.reltuples of the index to be verified, weighing a practical tolerance for false negatives, capping with work_mem. - We might throw an error immediately if it's impossible to get a reasonably low probability of false negatives -- for some value of "reasonable". * Perform existing verification checks for a B-Tree. While scanning the index, hash index tuples, including their heap TID pointer. Build the bloom filter with hashed values as we go. As we Scan the heap, we: * Verify that HOT safety was correctly assessed in respect of the index (or indexes) being verified. * Test the visibility map, and sanity check MultiXacts [1]. * Probe index/heap match check (uses bloom filter): If a heap tuple meets the following conditions: - Is not a HOT update tuple. - Is committed, and committed before RecentGlobalXmin. - Satisfies any predicate (for partial index case). Then: - Build a would-be index tuple value, perhaps reusing CREATE INDEX code. - Hash that in the same manner as in the original index pass. - Raise an error if the bloom filter indicates it is not present in the index. Seems like we should definitely go to the index first, because the heap is where we'll find visibility information. If I wanted to go about implementing the same index/heap checks in a way that does not have the notable downside around false negatives, I suppose I'd have to invent a new, internal mechanism that performed a kind of special merge join of an index against the heap. That would be complicated, and require a lot of code; in other words, it would be bug-prone. I wouldn't say that amcheck is simple today, but at least the complexity owes everything to how B-Trees already work, as opposed to how a bunch of custom infrastructure we had to invent works. The amount of code in amcheck's verify_nbtree.c file is pretty low, and that's a good goal to stick with. The very detailed comment that argues for the correctness of amcheck's bt_right_page_check_scankey() function is, to a large degree, also arguing for the correctness of a bunch of code within nbtree. amcheck verification should be comprehensive, but also simple and minimal, which IMV is a good idea for about the same reason that that's a good idea when writing unit tests. The merge join style approach would also make verification quite expensive, particularly when one table has multiple indexes. A tool whose overhead can only really be justified when we're pretty sure that there is already a problem is *significantly* less useful. And, it ties verification to the executor, which could become a problem if we make the functionality into a library that is usable by backup tools that don't want to go through the buffer manager (or any SQL interface). Apart from the low memory overhead of using a bloom filter, resource management is itself made a lot easier. We won't be sensitive to misestimations, because we only need an estimate of the number of tuples in the index, which will tend to be accurate enough in the vast majority of cases. reltuples is needed to size the bloom filter bitmap up-front. It doesn't matter how wide individual index tuples turn out to be, because we simply hash everything, including even the heap TID contained within the index. Using a bloom filter makes verification "stackable" in a way that might become important later. For example, we can later improve amcheck to verify a table in parallel, by having a parallel worker verify one index each, with bloom filters built in fixed size shared memory buffers. A parallel heap scan then has workers concurrently verify heap pages, and concurrently probe each final bloom filter. Similarly, everything works the same if we add the option of scanning a B-Tree in physical order (at the expense of not doing cross-page verification). And, while I'd start with nbtree, you can still pretty easily generalize the approach to building the bloom filter across AMs. All index AMs other than BRIN have index tuples that are essentially some values that are either the original heap values themselves, or values that are a function of those original values, plus a heap TID pointer. So, GIN might compress the TIDs, and deduplicate the values, and SP-GiST might use its own compression, but it's pretty easy to build a conditioned IndexTuple binary string that normalizes away these differences, which is what we actually hash. When WARM introduces the idea of a RED or BLUE tuple, it may be possible to add that to the conditioning logic. (BTW, the more advanced requirements are why I think that at least some parts of amcheck should eventually end up in core -- most of these are a long way off, but seem worth thinking about now.) We can add a random seed value to the mix when hashing, so that any problem that is masked by hash collisions won't stay masked on repeat verification attempts. Errors from corruption display this value, which lets users reliably recreate the original error by explicitly setting the seed the next time around. Say, when users need to verify with total confidence that a particular issue has been fixed within a very large table in production. I'd like to hear feedback on the general idea, and what the user-visible interface ought to look like. The non-deterministic false negatives may need to be considered by the user visible interface, which is the main reason I mention it. [1] postgr.es/m/20161017014605.ga1220...@tornado.leadboat.com -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers