> On May 13, 2020, at 3:29 PM, Peter Geoghegan <p...@bowt.ie> wrote:
>
> On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvhe...@2ndquadrant.com>
> wrote:
>> Hmm. I think we should (try to?) write code that avoids all crashes
>> with production builds, but not extend that to assertion failures.
>
> Assertions are only a problem at all because Mark would like to write
> tests that involve a selection of truly corrupt data. That's a new
> requirement, and one that I have my doubts about.
>
>>> I'll stick with your example. You're calling
>>> TransactionIdDidCommit() from check_tuphdr_xids(), which will
>>> interrogate the commit log and pg_subtrans. It's just not under your
>>> control.
>>
>> in a production build this would just fail with an error that the
>> pg_xact file cannot be found, which is fine -- if this happens in a
>> production system, you're not disturbing any other sessions. Or maybe
>> the file is there and the byte can be read, in which case you would get
>> the correct response; but that's fine too.
>
> I think that this is fine, too, since I don't consider assertion
> failures with corrupt data all that important. I'd make some effort to
> avoid it, but not too much, and not at the expense of a useful general
> purpose assertion that could catch bugs in many different contexts.
I am not removing any assertions. I do not propose to remove any assertions.
When I talk about "hardening against assertions", that is not in any way a
proposal to remove assertions from the code. What I'm talking about is writing
the amcheck contrib module code in such a way that it only calls a function
that could assert on bad data after checking that the data is not bad.
I don't know that hardening against assertions in this manner is worth doing,
but this is none the less what I'm talking about. You have made decent
arguments that it probably isn't worth doing for the btree checking code. And
in any event, it is probably something that could be addressed in a future
patch after getting this patch committed.
There is a separate but related question in the offing about whether the
backend code, independently of any amcheck contrib stuff, should be more
paranoid in how it processes tuples to check for corruption. The heap deform
tuple code in question is on a pretty hot code path, and I don't know that
folks would accept the performance hit of more checks being done in that part
of the system, but that's pretty far from relevant to this patch. That should
be hashed out, or not, at some other time on some other thread.
> I would be willing to make a larger effort to avoid crashing a
> backend, since that affects production. I might go to some effort to
> not crash with downright adversarial inputs, for example. But it seems
> inappropriate to take extreme measures just to avoid a crash with
> extremely contrived inputs that will probably never occur.
I think this is a misrepresentation of the tests that I've been running. There
are two kinds of tests that I have done:
First, there is the regression tests, t/004_verify_heapam.pl, which is
obviously contrived. That was included in the regression test suite because it
needed to be something other developers could read, verify, "yeah, I can see
why that would be corruption, and would give an error message of the sort the
test expects", and then could be run to verify that indeed that expected error
message was generated.
The second kind of corruption test I have been running is nothing more than
writing random nonsense into randomly chosen locations within heap files and
then running verify_heapam against those heap relations. It's much more Murphy
than Machiavelli when it's just generated by calling random(). When I
initially did this kind of testing, the heapam checking code had lots of
problems. Now it doesn't. There's very little contrived about that which I
can see. It's the kind of corruption you'd expect from any number of faulty
storage systems. The one "contrived" aspect of my testing in this regard is
that the script I use to write random nonsense to random locations in heap
files is smart enough not to write random junk to the page headers. This is
because if I corrupt the page headers, the backend never even gets as far as
running the verify_heapam functions, as the page cache rejects loading the page.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company