On 11.11.2010 17:16, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakan...@enterprisedb.com>  writes:
GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

I think it'd be okay as far as that one entry is concerned, since as you
say it doesn't matter whether a search finds it.  (We'd have to be sure
that VACUUM would still find it to remove it, of course, but that
doesn't use a normal search.)  You're right that it poses a hazard of
subsequent inserts deciding that they don't need to do work on upper
levels because the lower ones look OK already.  But depending on the
details of the search algorithm, this might be a non-problem: if you
remember that the upper level entries didn't cover your key when you
descended, you'd still know you need to recompute them.

Hmm, we don't currently keep track of that when we descend the tree to choose the target page, but perhaps an extra Consistent call to check that would be acceptable. We already call Penalty for every tuple on each internal node on the way, so compared to that one more call should not be too expensive. If we do that, I think it would simplify the algorithm quite a bit to just update all the parents on the way down, instead of traversing up from the bottom after inserting the tuple to the leaf.

Something else I just noticed is that WAL replay isn't capable of
completely fixing the index anyway:

  * To complete insert we can't use basic insertion algorithm because
  * during insertion we can't call user-defined support functions of opclass.
  * So, we insert 'invalid' tuples without real key and do it by separate 
algorithm.
  * 'invalid' tuple should be updated by vacuum full.

Given that there's no more vacuum full, and nobody has been expected to
run it routinely for a long time anyway, this fixup approach seems
pretty completely broken anyhow.

The 'invalid' tuples don't affect correctness, but are a drag on performance, so they are similar to incomplete b-tree splits. I suspect the overhead of an invalid gist pointer is much bigger than the overhead of an incomplete b-tree split, though. I agree we should get rid of that, it's not comforting to get a stream of messages in the logs saying you should run VACUUM FULL.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to