> On May 13, 2020, at 3:29 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> 
> On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvhe...@2ndquadrant.com> 
> wrote:
>> Hmm.  I think we should (try to?) write code that avoids all crashes
>> with production builds, but not extend that to assertion failures.
> 
> Assertions are only a problem at all because Mark would like to write
> tests that involve a selection of truly corrupt data. That's a new
> requirement, and one that I have my doubts about.
> 
>>> I'll stick with your example. You're calling
>>> TransactionIdDidCommit() from check_tuphdr_xids(), which will
>>> interrogate the commit log and pg_subtrans. It's just not under your
>>> control.
>> 
>> in a production build this would just fail with an error that the
>> pg_xact file cannot be found, which is fine -- if this happens in a
>> production system, you're not disturbing any other sessions.  Or maybe
>> the file is there and the byte can be read, in which case you would get
>> the correct response; but that's fine too.
> 
> I think that this is fine, too, since I don't consider assertion
> failures with corrupt data all that important. I'd make some effort to
> avoid it, but not too much, and not at the expense of a useful general
> purpose assertion that could catch bugs in many different contexts.

I am not removing any assertions.  I do not propose to remove any assertions. 
When I talk about "hardening against assertions", that is not in any way a 
proposal to remove assertions from the code.  What I'm talking about is writing 
the amcheck contrib module code in such a way that it only calls a function 
that could assert on bad data after checking that the data is not bad.

I don't know that hardening against assertions in this manner is worth doing, 
but this is none the less what I'm talking about.  You have made decent 
arguments that it probably isn't worth doing for the btree checking code.  And 
in any event, it is probably something that could be addressed in a future 
patch after getting this patch committed.

There is a separate but related question in the offing about whether the 
backend code, independently of any amcheck contrib stuff, should be more 
paranoid in how it processes tuples to check for corruption.  The heap deform 
tuple code in question is on a pretty hot code path, and I don't know that 
folks would accept the performance hit of more checks being done in that part 
of the system, but that's pretty far from relevant to this patch.  That should 
be hashed out, or not, at some other time on some other thread.

> I would be willing to make a larger effort to avoid crashing a
> backend, since that affects production. I might go to some effort to
> not crash with downright adversarial inputs, for example. But it seems
> inappropriate to take extreme measures just to avoid a crash with
> extremely contrived inputs that will probably never occur.

I think this is a misrepresentation of the tests that I've been running.  There 
are two kinds of tests that I have done:

First, there is the regression tests, t/004_verify_heapam.pl, which is 
obviously contrived.  That was included in the regression test suite because it 
needed to be something other developers could read, verify, "yeah, I can see 
why that would be corruption, and would give an error message of the sort the 
test expects", and then could be run to verify that indeed that expected error 
message was generated.

The second kind of corruption test I have been running is nothing more than 
writing random nonsense into randomly chosen locations within heap files and 
then running verify_heapam against those heap relations.  It's much more Murphy 
than Machiavelli when it's just generated by calling random().  When I 
initially did this kind of testing, the heapam checking code had lots of 
problems.  Now it doesn't.  There's very little contrived about that which I 
can see. It's the kind of corruption you'd expect from any number of faulty 
storage systems.  The one "contrived" aspect of my testing in this regard is 
that the script I use to write random nonsense to random locations in heap 
files is smart enough not to write random junk to the page headers.  This is 
because if I corrupt the page headers, the backend never even gets as far as 
running the verify_heapam functions, as the page cache rejects loading the page.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply via email to