On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > You are right and I have changed the code as per your suggestion.
So... + /* + * We always maintain the pin on bucket page for whole scan operation, + * so releasing the additional pin we have acquired here. + */ + if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE) + _hash_dropbuf(rel, *bufp); This relies on the page contents to know whether we took a pin; that seems like a bad plan. We need to know intrinsically whether we took a pin. + * If the bucket split is in progress, then we need to skip tuples that + * are moved from old bucket. To ensure that vacuum doesn't clean any + * tuples from old or new buckets till this scan is in progress, maintain + * a pin on both of the buckets. Here, we have to be cautious about It wouldn't be a problem if VACUUM removed tuples from the new bucket, because they'd have to be dead anyway. It also wouldn't be a problem if it removed tuples from the old bucket that were actually dead. The real issue isn't vacuum anyway, but the process of cleaning up after a split. We need to hold the pin so that tuples being moved from the old bucket to the new bucket by the split don't get removed from the old bucket until our scan is done. + old_blkno = _hash_get_oldblock_from_newbucket(rel, opaque->hasho_bucket); Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I feel like I'm repeating this ad nauseum, but I really think it's bad to rely on the special space instead of our own local variables! - /* we ran off the end of the bucket without finding a match */ + /* + * We ran off the end of the bucket without finding a match. + * Release the pin on bucket buffers. Normally, such pins are + * released at end of scan, however scrolling cursors can + * reacquire the bucket lock and pin in the same scan multiple + * times. + */ *bufP = so->hashso_curbuf = InvalidBuffer; ItemPointerSetInvalid(current); + _hash_dropscanbuf(rel, so); I think this comment is saying that we'll release the pin on the primary bucket page for now, and then reacquire it later if the user reverses the scan direction. But that doesn't sound very safe, because the bucket could be split in the meantime and the order in which tuples are returned could change. I think we want that to remain stable within a single query execution. + _hash_readnext(rel, &buf, &page, &opaque, + (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false); Same comment: don't rely on the special space to figure this out. Keep track. Also != 0 would be better than ? true : false. + /* + * setting hashso_skip_moved_tuples to false + * ensures that we don't check for tuples that are + * moved by split in old bucket and it also + * ensures that we won't retry to scan the old + * bucket once the scan for same is finished. + */ + so->hashso_skip_moved_tuples = false; I think you've got a big problem here. Suppose the user starts the scan in the new bucket and runs it forward until they end up in the old bucket. Then they turn around and run the scan backward. When they reach the beginning of the old bucket, they're going to stop, not move back to the new bucket, AFAICS. Oops. _hash_first() has a related problem: a backward scan starts at the end of the new bucket and moves backward, but it should start at the end of the old bucket, and then when it reaches the beginning, flip to the new bucket and move backward through that one. Otherwise, a backward scan and a forward scan don't return tuples in opposite order, which they should. I think what you need to do to fix both of these problems is a more thorough job gluing the two buckets together. I'd suggest that the responsibility for switching between the two buckets should probably be given to _hash_readprev() and _hash_readnext(), because every place that needs to advance to the next or previous page that cares about this. Right now you are trying to handle it mostly in the functions that call those functions, but that is prone to errors of omission. Also, I think that so->hashso_skip_moved_tuples is badly designed. There are two separate facts you need to know: (1) whether you are scanning a bucket that was still being populated at the start of your scan and (2) if yes, whether you are scanning the bucket being populated or whether you are instead scanning the corresponding "old" bucket. You're trying to keep track of that using one Boolean, but one Boolean only has two states and there are three possible states here. + if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf)) + { + + /* release the lock on bucket buffer, before completing the split. */ Extra blank line. +moved-by-split flag on a tuple indicates that tuple is moved from old to new +bucket. The concurrent scans can skip such tuples till the split operation is +finished. Once the tuple is marked as moved-by-split, it will remain so forever +but that does no harm. We have intentionally not cleared it as that can generate +an additional I/O which is not necessary. The first sentence needs to start with "the" but the second sentence shouldn't. It would be good to adjust this part a bit to more clearly explain that the split-in-progress and split-cleanup flags are bucket-level flags, while moved-by-split is a per-tuple flag. It's possible to figure this out from what you've written, but I think it could be more clear. Another thing that is strange is that the code uses THREE flags, bucket-being-split, bucket-being-populated, and needs-split-cleanup, but the README conflates the first two and uses a different name. +previously-acquired content lock, but not pin and repeat the process using the s/but not pin/but not the pin,/ A problem is that if a split fails partway through (eg due to insufficient -disk space) the index is left corrupt. The probability of that could be -made quite low if we grab a free page or two before we update the meta -page, but the only real solution is to treat a split as a WAL-loggable, +disk space or crash) the index is left corrupt. The probability of that +could be made quite low if we grab a free page or two before we update the +meta page, but the only real solution is to treat a split as a WAL-loggable, must-complete action. I'm not planning to teach hash about WAL in this -go-round. +go-round. However, we do try to finish the incomplete splits during insert +and split. I think this paragraph needs a much heavier rewrite explaining the new incomplete split handling. It's basically wrong now. Perhaps replace it with something like this: -- If a split fails partway through (e.g. due to insufficient disk space or an interrupt), the index will not be corrupted. Instead, we'll retry the split every time a tuple is inserted into the old bucket prior to inserting the new tuple; eventually, we should succeed. The fact that a split is left unfinished doesn't prevent subsequent buckets from being split, but we won't try to split the bucket again until the prior split is finished. In other words, a bucket can be in the middle of being split for some time, but ti can't be in the middle of two splits at the same time. Although we can survive a failure to split a bucket, a crash is likely to corrupt the index, since hash indexes are not yet WAL-logged. -- + Acquire cleanup lock on target bucket + Scan and remove tuples + For overflow page, first we need to lock the next page and then + release the lock on current bucket or overflow page + Ensure to have buffer content lock in exclusive mode on bucket page + If buffer pincount is one, then compact free space as needed + Release lock I don't think this summary is particularly correct. You would never guess from this that we lock each bucket page in turn and then go back and try to relock the primary bucket page at the end. It's more like: acquire cleanup lock on primary bucket page loop: scan and remove tuples if this is the last bucket page, break out of loop pin and x-lock next page release prior lock and pin (except keep pin on primary bucket page) if the page we have locked is not the primary bucket page: release lock and take exclusive lock on primary bucket page if there are no other pins on the primary bucket page: squeeze the bucket to remove free space Come to think of it, I'm a little worried about the locking in _hash_squeezebucket(). It seems like we drop the lock on each "write" bucket page before taking the lock on the next one. So a concurrent scan could get ahead of the cleanup process. That would be bad, wouldn't it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers