I finally got a chance to look at a reproducible case of this, thanks to Vlastimil Krejcir. It's now blindingly obvious that my patch of 2006-11-19, which was intended to fix bug #2737, has broken things.
The problem is partly a misordering of operations in _hash_expandtable(): it calls _hash_alloc_buckets() before it's actually committed to doing the table expansion. If it aborts the bucket split because it can't get the bucket lock, the relation EOF has already been pushed out, and thus the next time we come through and actually do a bucket split, the sanity check that compares actual to expected size blows up. As well it should. This part is not hard to fix: we can just re-order the operations. However there is a remaining risk, which is that _hash_alloc_buckets() could fail internally after partially advancing the EOF. If it's trying to expand the index across a 1Gb segment boundary, and fails (eg, due to out-of-disk-space) after doing at least one successful smgrextend, then the index is simply broken, because the filesystem EOF won't match what we expect, causing subsequent bucket splits to panic here. Now a failure right at that point is pretty unlikely, but it could happen, and having the index broken to the point of PANIC certainly isn't a very acceptable consequence. I can't see any way that we could positively guarantee there is no such failure, since we are changing filesystem state that's not roll-back-able. I fear what we have to do here is abandon the idea that we can ensure that the smgr/filesystem EOF for the index is exactly the last used page; rather, the invariant will need to be that the EOF is greater than or equal to the last used page. This is a bit annoying because it blows away the concurrency improvement I had hoped to make in _hash_getovflpage, as per this comment: * We have to fetch the page with P_NEW to ensure smgr's idea of the * relation length stays in sync with ours. XXX It's annoying to do this * with metapage write lock held; would be better to use a lock that * doesn't block incoming searches. Best way to fix it would be to stop * maintaining hashm_spares[hashm_ovflpoint] and rely entirely on the * smgr relation length to track where new overflow pages come from; * then we could release the metapage before we do the smgrextend. * FIXME later (not in beta...) This approach will condemn us to relying on hashm_spares[hashm_ovflpoint] forever, because it's really what tracks the "last used page". I don't see any very good alternative, though. I thought for a bit about still relying on the EOF as the indication of how much space is used, but that would mean that a failure partway through _hash_alloc_buckets() would result in permanent wastage of however much space it had managed to allocate before failing. Another problem is that I'm not sure this wouldn't break the overflow page addressing scheme. (The overflow page addressing scheme is unreasonably complicated in any case ... I've occasionally thought about getting rid of the bitmaps in favor of a simple linked list of unused pages ... but having to change it makes any such fix even more invasive.) Anyone have a better idea? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate