Re: new heapcheck contrib module

Mark Dilger Tue, 12 May 2020 19:08:18 -0700

> On May 12, 2020, at 5:34 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> 
> On Mon, May 11, 2020 at 10:21 AM Mark Dilger
> <mark.dil...@enterprisedb.com> wrote:
>> 2) amcheck's btree checking functions have been refactored to be able to 
>> operate in two modes; the original mode in which all errors are reported via 
>> ereport, and a new mode for returning errors as rows from a set returning 
>> function.

Thank you yet again for reviewing.  I really appreciate the feedback!

> Somebody suggested that I make amcheck work in this way during its
> initial development. I rejected that idea at the time, though. It
> seems hard to make it work because the B-Tree index scan is a logical
> order index scan. It's quite possible that a corrupt index will have
> circular sibling links, and things like that. Making everything an
> error removes that concern. There are clearly some failures that we
> could just soldier on from, but the distinction gets rather blurred.

Ok, I take your point that the code cannot soldier on after the first error is 
returned.  I'll change that for v6 of the patch, moving on to the next relation 
after hitting the first corruption in any particular index.  Do you mind that I 
refactored the code to return the error rather than ereporting?  If it offends 
your sensibilities, I could rip that back out, at the expense of having to use 
try/catch logic in some other places.  I prefer to avoid the try/catch stuff, 
but I'm not going to put up a huge fuss.

> I understand why you want to do it this way. It makes sense that the
> heap stuff would report all inconsistencies together, at the end. I
> don't think that that's really workable (or even desirable) in the
> case of B-Tree indexes, though. When an index is corrupt, the solution
> is always to do root cause analysis, to make sure that the issue does
> not recur, and then to REINDEX. There isn't really a question about
> doing data recovery of the index structure.

Yes, I agree that reindexing is the most sensible remedy.  I certainly have no 
plans to implement some pg_fsck_index type tool.  Even for tables, I'm not 
interested in creating such a tool. I just want a good tool for finding out 
what the nature of the corruption is, as that might make it easier to debug 
what went wrong.  It's not just for debugging production systems, but also for 
chasing down problems in half-baked code prior to release.

> Would it be possible to log the first B-Tree inconsistency, and then
> move on to the next high-level phase of verification? You don't have
> to throw an error, but it seems like a good idea for amcheck to still
> give up on further verification of the index.

Ok, good, it sounds like we're converging on the same idea.  I'm happy to do so.

> The assertion failure that you reported happens because of a generic
> assertion made from _bt_compare(). It doesn't have anything to do with
> amcheck (you'll see the same thing from regular index scans), really.

Oh, I know that already.  I could see that easily enough in the backtrace.  But 
if you look at the way I implemented verify_heapam, you might notice this:

/*
 * check_tuphdr_xids
 *
 *  Determine whether tuples are visible for verification.  Similar to
 *  HeapTupleSatisfiesVacuum, but with critical differences.
 *
 *  1) Does not touch hint bits.  It seems imprudent to write hint bits
 *     to a table during a corruption check.
 *  2) Only makes a boolean determination of whether verification should
 *     see the tuple, rather than doing extra work for vacuum-related
 *     categorization.
 *
 *  The caller should already have checked that xmin and xmax are not out of
 *  bounds for the relation.
 */

The point is that when checking the table for corruption I avoid calling 
anything that might assert (or segfault, or whatever).  I was talking about 
refactoring the btree checking code to be similarly careful.

> I think that removing that assertion would be the opposite of
> hardening. Even if you removed it, the backend will still crash once
> you come up with a slightly more evil index tuple. Maybe *that* could
> be mostly avoided with widespread hardening; we could in principle
> perform cross-checks of varlena headers against the tuple or page
> layout at any point reachable from _bt_compare(). That seems like
> something that would have unacceptable overhead, because the cost
> would be imposed on everything. And even then you've only ameliorated
> the problem.

I think we may have different mental models of how this all works in practice.  
I am (or was) envisioning that the backend, during regular table and index 
scans, cannot afford to check for corruption at all steps along the way, and 
therefore does not, but that a corruption checking tool has a fundamentally 
different purpose, and can and should choose to operate in a way that won't 
blow up when checking a corrupt relation.  It's the difference between a car 
designed to drive down the highway at high speed vs. a military vehicle 
designed to drive over a minefield with a guy on the front bumper scanning for 
landmines, the whole while going half a mile an hour.

I'm starting to infer from your comments that you see the landmine detection 
vehicle as also driving at high speed, detecting landmines on occasion by 
seeing them first, but frequently by failing to see them and just blowing up.

> Code like amcheck's PageGetItemIdCareful() goes further than the
> equivalent backend macro (PageGetItemId()) to avoid assertion failures
> and crashes with corrupt data. I doubt that it is practical to take it
> much further than that, though. It's subject to diminishing returns.

Ok.

> In general, _bt_compare() calls user-defined code that is usually
> written in C. This C code could in principle feel entitled to do any
> number of scary things when you corrupt the input data. The amcheck
> module's dependency on user-defined operator code is totally
> unavoidable -- it is the single source of truth for the nbtree checks.

I don't really understand this argument, since users with buggy user defined 
operators are not the target audience, but I also don't think there is any 
point in arguing it, since I'm already resolved to take your advice about not 
hardening the btree stuff any further.

> It boils down to this: I think that regression tests that run on the
> buildfarm and actually corrupt data are not practical, at least in the
> case of the index checks -- though probably in all cases. Look at the
> pageinspect "btree.out" test output file -- it's very limited, because
> we have to work around a bunch of implementation details. It's no
> accident that the bt_page_items() test shows a palindrome value in the
> data column (the value is "01 00 00 00 00 00 00 01"). That's an
> endianness workaround.

One of the delays in submitting the most recent version of the patch is that I 
was having trouble creating a reliable, portable btree corrupting regression 
test.  Ultimately, I submitted v5 without any btree corrupting regression test, 
as it proved pretty difficult to write one good enough for submission, and I 
had already put a couple more days into developing v5 than I had intended.  So 
I can't argue too much with your point here.

I did however address (some?) issues that you and others mentioned about the 
table corrupting regression test.  Perhaps there are remaining issues that will 
show up on machines with different endianness than I have thus far tested, but 
I don't see that they will be insurmountable.  Are you fundamentally opposed to 
that test framework?  If you're going to vote against committing the patch with 
that test, I'll back down and just remove it from the patch, but it doesn't 
seem like a bad regression test to me.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: new heapcheck contrib module

Reply via email to