On 05.09.2011 21:55, Bruce Momjian wrote:
Heikki Linnakangas wrote:
On 11.03.2011 19:41, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakan...@enterprisedb.com>   writes:
On 11.03.2011 17:59, Tom Lane wrote:
But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

Hmm.  It's not so improbable that checkpoint would start inside that
window, but that the parent insertion is still pending by the time the
checkpoint finishes is pretty improbable.

How about just reducing the deletion-time ERROR for missing downlink to a LOG?

Well, the code that follows expects to have a valid parent page locked,
so you can't literally do just that. But yeah, LOG and aborting the page
deletion seems fine to me.

Did this get fixed?

Nope.

On a closer look, this isn't only a problem for page deletion. Page splitting also barfs if it can't find the parent of a page. As the code stands, a missing downlink is not harmless, but causes all sorts of trouble.

The window for this to happen with a checkpoint is extremely tight, but there's another situation where you can end up with a missing downlink: if you run out of disk space while splitting a parent page, to insert a downlink to it.

I think we should do a similar fix to b-tree that I did to GiST, and put a flag on pages with missing downlinks. Then we can fix the missing downlinks in vacuum and insertion, and get rid of the code to fix incomplete splits after WAL replay.

The way it would work is that on page split the right page is flagged with MISSING_DOWNLINK flag. When the downlink is inserted into the parent, the flag is cleared in the same critical section as the WAL record for the insertion of the parent is written. Normally, a backend would never see the flag set, because the locks on the split pages are not released until the parent record is written and the flag cleared again. But if inserting the downlink fails for any reason, the next inserter or vacuum that steps on the page can finish the split by inserting the downlink.

Unfortunately that means holding the locks on the split pages longer than we do at the moment. Currently they are released as soon as the parent page is locked; with this change they would need to be held until the WAL record of the downlink insertion is done. B-tree is so heavily used that I'm a bit hesitant to sacrifice any concurrency there, but I don't think it would be noticeable in practice.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to