On Fri, Nov 1, 2013 at 9:49 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > The bigger picture here is that such an approach amounts to deciding that > no one will ever be allowed to fix hash indexes. I'm not for that, even > if I'm not volunteering to be the fixer myself.
Yeah. I have thought about doing some work on this, although I don't know when I'll ever have the time. As far as I can see, the basic problem is that we need a better way of splitting buckets. Right now, splitting a bucket means locking it, scanning the entire bucket chain and moving ~1/2 the tuples to the new bucket, and then unlocking it. Since this is a long operation, we have to use heavyweight locks to protect the buckets, which is bad for concurrency. Since it involves a modification to an arbitrarily large number of pages that has to be atomic, it's not possible to WAL-log it sensibly -- and in fact, I think this is really the major obstacle to being able to implementing WAL-logging for hash indexes. I think we could work around this problem as follows. Suppose we have buckets 1..N and the split will create bucket N+1. We need a flag that can be set and cleared on the metapage, call it split-in-progress, and a flag that can optionally be set on particular index tuples, call it moved-by-split. We will allow scans of all buckets and insertions into all buckets while the split is in progress, but (as now) we will not allow more than one split to be in progress at the same time. We start the split by updating the metapage to incrementing the number of buckets and set the split-in-progress flag. While the split-in-progress flag is set, any scans of bucket N+1 first scan that bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan bucket (N+1)/2. The moved-by-split flag never has any effect except when scanning the highest-numbered bucket that existed at the start of that particular scan, and then only if the split-in-progress flag was also set at that time. Once the split operation has set the split-in-progress flag, it will begin scanning bucket (N+1)/2. Every time it finds a tuple that properly belongs in bucket N+1, it will insert the tuple into bucket N+1 with the moved-by-split flag set. Tuples inserted by anything other than a split operation will leave this flag clear, and tuples inserted while the split is in progress will target the same bucket that they would hit if the split were already complete. Thus, bucket N+1 will end up with a mix of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples coming from parallel insertion activity. When the scan of bucket (N+1)/2 is complete, we know that bucket N+1 now contains all the tuples that are supposed to be there, so we clear the split-in-progress flag. Future scans of both buckets can proceed normally. Of course, there's a bit of a problem here: we haven't actually removed any tuples from bucket (N+1)/2. That's not a correctness problem; we'll check for an exact hash-value match before returning any tuples from this bucket from the scan, so tuples that aren't present in the bucket but should be will just ignored anyway. But it's clearly something that needs to be fixed before we can really declare victory. We need to wait lfor the completion of any scans that began before we finished populating bucket N+1, because otherwise we might remove tuples that they're still expecting to find in bucket (N+1)/2. I'm not sure if there's some clever trick we can use to accomplish this; e.g. if the scan will always maintain a pin, we could take a buffer cleanup lock on all the buffers in buckets N+1 and (N+1)/2 in the same sequence a scan would visit them. Of course, even if that works, it might be a while if somebody's left a cursor lying around. And even if scans can't be counted on to maintain a pin, a point about which I'm unsure off-hand, then things get even trickier. But maybe there's some solution. Anyway, once we out-wait concurrent scans, we can go back and nuke everything from (N+1)/2 that no longer maps onto that bucket. It might be possible to make this work opportunistically, like HOT cleanup: if we're scanning a buffer, and RecentGlobalXmin has progressed enough that we know there can't be any pre-split scans in progress (or if we can recognize in some other inexpensive way that old scans must be gone), and we see a tuple there that no longer maps onto that bucket, we scan the page and remove all such tuples. This is full of hand-waving, but I'd be curious to know whether the approach is perceived to have any merit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers