The B-tree page deletion algorithm has a race condition. We don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way.

Solutions
---------

1. The simplest solution I can see is to never delete internal pages, avoiding the whole concept of half-dead pages. That obviously leads to some bloat, though.

2. The second-simplest solution I see is to keep locked the whole chain of pages that will be deleted, and delete all of them as one atomic WAL-logged operation. Ie. the leaf page, and all the parent pages above it that will become half-dead, and the parent of the last half-dead page.

3. Another approach would be to get rid of the "can't delete rightmost child" limitation. We currently have that limitation because it ensures that we never need to change the high-key of a page. If we delete a page that is the rightmost child of its parent, we transfer the deleted keyspace from the parent page to its right sibling. To do that, we need to update the high key of the parent, as well as the downlink of the right sibling at the grandparent level. That's a bit complicated, because updating the high key might require splitting the page.

Still, I think it would be doable: When we delete a page that is the rightmost child of its parent, we mark the right sibling with an "out-of-whack" flag. It indicates that the keyspace of that page is not in sync in the parent level. Searches work fine, as there are no keys in the keyspace that is out-of-whack, but before allowing any insertions to the page, the situation needs to be fixed. That is the responsibility of the next insert to the page that comes along. To fix it, the parent's left sibling's high key needs to be updated, as well as the parent's downlink in the grandparent. We can do that in a few discrete steps; first update the high key, then update the downlink, and finally once the high key and parent's downlink are in sync with the reality at the bottom level, clear the "out-of-whack" flag.

On reflection, approach 2 seems best. It's fairly simple, although it potentially requires holding many pages locked simultaneously. In practice, B-trees are typically not very deep. We have a limit of 4 full-page images in a single WAL record, but since all the pages are empty, we can just WAL-log all the information needed to reconstruct the pages in deleted state without using full-page images. This also might be back-patchable, although it requires changing the WAL record format, so it would break replication from higher minor version to lower.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to