Re: [HACKERS] B-tree parent pointer and checkpoints

Heikki Linnakangas Fri, 12 Nov 2010 11:21:27 -0800

On 11.11.2010 20:34, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakan...@enterprisedb.com>  writes:

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.


Oh, that's a really good idea, I think.  But what about page splits?
I guess in the case of a split, you'd be replacing the parent entry
anyway, so having previously updated it to something larger doesn't
really cause a problem other than wasting a few cycles --- which are
probably still less than you save by not having to traverse back up.

I started looking at this, and run into a problem with page splits. Theright-links in GiST work differently from b-tree, a right-link is onlyfollowed if we detect a concurrent page split. A concurrent split isdetected by comparing the "NSN" field on the child page against the LSNthat we saw on the parent when walking down. It means that if you justleave the incompletely split page in the tree, where one half is missingthe parent pointer, scans will not find any tuples on that page. Theywould at first, but as soon as the the parent page is updated due tosome unrelated insertion, the LSN of the parent is bumped above the NSNstored on the child, and the page becomes invisible to scanners.

We avoid that problem during normal operation by keeping the parent pagelocked while the child is split, until the downlink is inserted into theparent. That blocks any other modifications to the parent page thatwould bump the LSN, until our downlink has been inserted. That doesn'twork after crash recovery, as all the locks are released.

I think we can work around that with a small modification to the pagesplit algorithm. In a nutshell, when the child page is split, put a flagon the left half indicating that the rightlink must always be followed,regardless of the NSN. When the downlink is inserted to the parent,clear the flag. Setting and clearing of these flags need to be performedduring WAL replay as well.


So to split a page:

(0. Lock the page to be split)

1. Split the page. Mark the rightlink in the left half with a flagindicating that it always needs to be followed.

2. Lock the parent.
3. Insert downlink. (The parent may need to be split too)

4. Clear the flag in the child, and update NSN to the LSN of thedownlink insertion record.

5. Release child.

6. If the parent was split in step 3, goto 2.

If we crash between steps 1 and 3, the rightlink will have the flag, soscans will know to always follow it. If we crash after step 3, recoverywill replay steps 3 and 4, so scans will see the downlinks as usual.

After a crash, the tree can be fixed up later by vacuum or subsequentinserts, by performing steps 2-4.



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-tree parent pointer and checkpoints

Reply via email to