I finally got a chance to look at a reproducible case of this, thanks
to Vlastimil Krejcir.  It's now blindingly obvious that my patch of
2006-11-19, which was intended to fix bug #2737, has broken things.

The problem is partly a misordering of operations in
_hash_expandtable(): it calls _hash_alloc_buckets() before it's actually
committed to doing the table expansion.  If it aborts the bucket split
because it can't get the bucket lock, the relation EOF has already been
pushed out, and thus the next time we come through and actually do a
bucket split, the sanity check that compares actual to expected size
blows up.  As well it should.

This part is not hard to fix: we can just re-order the operations.
However there is a remaining risk, which is that _hash_alloc_buckets()
could fail internally after partially advancing the EOF. If it's trying
to expand the index across a 1Gb segment boundary, and fails (eg, due to
out-of-disk-space) after doing at least one successful smgrextend, then
the index is simply broken, because the filesystem EOF won't match what
we expect, causing subsequent bucket splits to panic here.  Now a
failure right at that point is pretty unlikely, but it could happen,
and having the index broken to the point of PANIC certainly isn't a very
acceptable consequence.

I can't see any way that we could positively guarantee there is no such
failure, since we are changing filesystem state that's not roll-back-able.
I fear what we have to do here is abandon the idea that we can ensure
that the smgr/filesystem EOF for the index is exactly the last used
page; rather, the invariant will need to be that the EOF is greater than
or equal to the last used page.  This is a bit annoying because it blows
away the concurrency improvement I had hoped to make in
_hash_getovflpage, as per this comment:

     * We have to fetch the page with P_NEW to ensure smgr's idea of the
     * relation length stays in sync with ours.  XXX It's annoying to do this
     * with metapage write lock held; would be better to use a lock that
     * doesn't block incoming searches.  Best way to fix it would be to stop
     * maintaining hashm_spares[hashm_ovflpoint] and rely entirely on the
     * smgr relation length to track where new overflow pages come from;
     * then we could release the metapage before we do the smgrextend.
     * FIXME later (not in beta...)

This approach will condemn us to relying on
hashm_spares[hashm_ovflpoint] forever, because it's really what tracks
the "last used page".  I don't see any very good alternative, though.

I thought for a bit about still relying on the EOF as the indication of
how much space is used, but that would mean that a failure partway
through _hash_alloc_buckets() would result in permanent wastage of
however much space it had managed to allocate before failing.  Another
problem is that I'm not sure this wouldn't break the overflow page
addressing scheme.  (The overflow page addressing scheme is unreasonably
complicated in any case ... I've occasionally thought about getting rid
of the bitmaps in favor of a simple linked list of unused pages ... but
having to change it makes any such fix even more invasive.)

Anyone have a better idea?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Reply via email to