On 16/03/2019 10:51, Peter Geoghegan wrote:
On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:
It would be nice if you could take a look at the amcheck "relocate"
patch
When I started looking at this, I thought that "relocate" means "move".
So I thought that the new mode would actually move tuples, i.e. that it
would modify the index. That sounded horrible. Of course, it doesn't
actually do that. It just checks that each tuple can be re-found, or
"relocated", by descending the tree from the root. I'd suggest changing
the language to avoid that confusion.

Okay. What do you suggest? :-)

Hmm. "re-find", maybe? We use that term in a few error messages already, to mean something similar.

It seems like a nice way to catch all kinds of index corruption issues.
Although, we already check that the tuples are in order within the page.
Is it really necessary to traverse the tree for every tuple, as well?
Maybe do it just for the first and last item?

It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.

Fair enough.

At first, I thought this would be horrendously expensive, but thinking about it a bit more, nearby tuples will always follow the same path through the upper nodes, so it'll all be cached. So maybe it's not quite so bad.

I don't understand this. Can you give an example of this kind of
inconsistency?

The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).

Hmm. So, the initial situation would be something like this:

                 +-----------+
                 | 1: root   |
                 |           |
                 | -inf -> 2 |
                 | 20   -> 3 |
                 |           |
                 +-----------+

        +-------------+ +-------------+
        | 2: internal | | 3: internal |
        |             | |             |
        | -inf -> 4   | | -inf -> 6   |
        | 10   -> 5   | | 30   -> 7   |
        |             | |             |
        | hi: 20      | |             |
        +-------------+ +-------------+

+---------+ +---------+ +---------+ +---------+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
|         | |         | |         | |         |
| 1       | | 11      | | 21      | | 31      |
| 5       | | 15      | | 25      | | 35      |
| 9       | | 19      | | 29      | | 39      |
|         | |         | |         | |         |
| hi: 10  | | hi: 20  | | hi: 30  | |         |
+---------+ +---------+ +---------+ +---------+

Then, a cosmic ray changes the 20 on the root page to 18. That causes the the leaf tuple 19 to become non-re-findable; if you descend the for 19, you'll incorrectly land on page 6. But it also causes the high key on page 2 to be different from the downlink on the root page. Wouldn't the existing checks catch this?

- Heikki

Reply via email to