On 09.11.2013 18:24, Tom Lane wrote:
Heikki Linnakangas <hlinnakan...@vmware.com> writes:
2. The second-simplest solution I see is to keep locked the whole chain
of pages that will be deleted, and delete all of them as one atomic
WAL-logged operation. Ie. the leaf page, and all the parent pages above
it that will become half-dead, and the parent of the last half-dead page.

This would be more tenable if we could put a known limit on the number of
pages to be changed at once.  I'm not too awake at the moment, so maybe
this is not possible, but could we simply decide in advance that we won't
propagate page deletion up more than a-small-number of levels?  It'd
require allowing a page to remain half-dead until some other vacuum comes
along and fixes it, but that seems OK.

I don't feel comfortable leaving pages in half-dead state. Looking back at the archives, your original design ten years ago did exactly that (http://www.postgresql.org/message-id/8281.1045089...@sss.pgh.pa.us), but that turned out to be a bad idea (http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=70ce5c908202ada7616f7afded8a91bbf2742471). Mind you, even if we check that the half-dead page at some upper level is eventually deletable because it's not the rightmost child, it might become the rightmost child if we don't hold the lock so that the next vacuum cannot delete it, and we're back to square one.

We could just punt if more than X pages would need to be changed. That would mean that we never delete pages at the top (h - X) levels of the tree. In practice that should be fine if X is high enough. As a data point, GIN list page deletion holds 16 pages locked at once (GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge. As another data point, in the worst case the keys are so wide that only 2 keys fit on each B-tree page. With that assumption, the tree can be at most 32 levels deep if you just insert into it with no deletions, since MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not sure). Deletions make it more complicated, but I would be pretty surprised if you can construct a B-tree tallers than, say 40 levels.

Things gets tricky if shared_buffers is very small; with shared_buffers=16, you certainly can't hold more than 16 buffers pinned at once. (in fact, I think ginfast.c already has a problem with that...)

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to