On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmh...@gmail.com> wrote: >>> Great, thanks. 0001 looks good to me now, so committed. >> >> Committed 0002. > > Here are some initial review thoughts on 0003 based on a first read-through.
More thoughts on the main patch: The text you've added to the hash index README throws around the term "single WAL entry" pretty freely. For example: "An insertion that causes an addition of an overflow page is logged as a single WAL entry preceded by a WAL entry for new overflow page required to insert a tuple." When you say a single WAL entry, you make it sound like there's only one, but because it's preceded by another WAL entry, there are actually two. I would rephase this to just avoid using the word "single" this way. For example: "If an insertion causes the addition of an overflow page, there will be one WAL entry for the new overflow page and a second entry for the insert itself." +mode anyway). It would seem natural to complete the split in VACUUM, but since +splitting a bucket might require allocating a new page, it might fail if you +run out of disk space. That would be bad during VACUUM - the reason for +running VACUUM in the first place might be that you run out of disk space, +and now VACUUM won't finish because you're out of disk space. In contrast, +an insertion can require enlarging the physical file anyway. But VACUUM actually does try to complete the splits, so this is wrong, I think. +Squeeze operation moves tuples from one of the buckets later in the chain to A squeeze operation +As Squeeze operation involves writing multiple atomic operations, it is As a squeeze operation involves multiple atomic operations + if (!xlrec.is_primary_bucket_page) + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD); So, in hashbucketcleanup, the page is registered and therefore an FPI will be taken if this is the first reference to this page since the checkpoint, but hash_xlog_delete won't restore the FPI. I think that exposes us to a torn-page hazard. _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same disease, as does _hash_squeezebucket/hash_xlog_move_page_contents. Not sure exactly what the fix should be. + /* + * As bitmap page doesn't have standard page layout, so this will + * allow us to log the data. + */ + XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD); + XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32)); If the page doesn't have a standard page layout, why are we passing REGBUF_STANDARD? But I think you've hacked things so that bitmap pages DO have a standard page layout, per the change to _hash_initbitmapbuffer, in which case the comment seems wrong. (The comment is also a little confusing grammatically, but let's iron out the substantive issue first.) + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE); + + PageSetLSN(BufferGetPage(ovflbuf), recptr); + PageSetLSN(BufferGetPage(buf), recptr); + + if (BufferIsValid(mapbuf)) + PageSetLSN(BufferGetPage(mapbuf), recptr); + + if (BufferIsValid(newmapbuf)) + PageSetLSN(BufferGetPage(newmapbuf), recptr); I think you forgot to call PageSetLSN() on metabuf. (There are 5 block references in the write-ahead log record but only 4 calls to PageSetLSN ... surely that can't be right!) + /* + * We need to release the locks once the prev pointer of overflow bucket + * and next of left bucket are set, otherwise concurrent read might skip + * the bucket. + */ + if (BufferIsValid(leftbuf)) + UnlockReleaseBuffer(leftbuf); + UnlockReleaseBuffer(ovflbuf); I don't believe the comment. Postponing the lock release wouldn't cause the concurrent read to skip the bucket. It might cause it to be delayed unnecessarily, but the following comment already addresses that point. I think you can just nuke this comment. + xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false; + xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false; Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar? + /* + * We release locks on writebuf and bucketbuf at end of replay operation + * to ensure that we hold lock on primary bucket page till end of + * operation. We can optimize by releasing the lock on write buffer as + * soon as the operation for same is complete, if it is not same as + * primary bucket page, but that doesn't seem to be worth complicating the + * code. + */ + if (BufferIsValid(writebuf)) + UnlockReleaseBuffer(writebuf); + + if (BufferIsValid(bucketbuf)) + UnlockReleaseBuffer(bucketbuf); Here in hash_xlog_squeeze_page(), you wait until the very end to release these locks, but in e.g. hash_xlog_addovflpage you release them considerably earlier for reasons that seem like they would also be valid here. I think you could move this back to before the updates to the metapage and bitmap page. + /* + * Note: in normal operation, we'd update the metapage while still + * holding lock on the page we inserted into. But during replay it's + * not necessary to hold that lock, since no other index updates can + * be happening concurrently. + */ This comment in hash_xlog_init_bitmap_page seems to be off-base, because we didn't just insert into a page. I'm not disputing that it's safe, though: not only can nobody else be modifying it because we're in recovery, but also nobody else can see it yet; the creating transaction hasn't yet committed. + /* + * Note that we still update the page even if it was restored from a full + * page image, because the bucket flag is not included in the image. + */ Really? Why not? (Because it's part of the special space?) Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to guard against concurrent scans? And also hash_xlog_split_complete? And hash_xlog_delete? If the regular code path is getting a cleanup lock to protect against concurrent scans, recovery better do it, too. + /* + * There is no harm in releasing the lock on old bucket before new bucket + * during replay as no other bucket splits can be happening concurrently. + */ + if (BufferIsValid(oldbuf)) + UnlockReleaseBuffer(oldbuf); And I'm not sure this is safe either. The issue isn't that there might be another bucket split happening concurrently, but that there might be scans that get confused by seeing the bucket split flag set before the new bucket is created or the metapage updated. Seems like it would be safer to keep all the locks for this one. + * Initialise the new bucket page, here we can't complete zeroed the page I think maybe a good way to write these would be "we can't simply zero the page". And Initialise -> Initialize. /* + * Change the shared buffer state in critical section, + * otherwise any error could make it unrecoverable after + * recovery. + */ + START_CRIT_SECTION(); + + /* * Insert tuple on new page, using _hash_pgaddtup to ensure * correct ordering by hashkey. This is a tad inefficient * since we may have to shuffle itempointers repeatedly. * Possible future improvement: accumulate all the items for * the new page and qsort them before insertion. */ (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup); + END_CRIT_SECTION(); No way. You have to start the critical section before making any page modifications and keep it alive until all changes have been logged. hash_redo's mappings of record types to function names is a little less regular than seems desirable. Generally, the function name is the lower-case version of the record type with "hash" and "xlog" switched. But XLOG_HASH_ADD_OVFL_PAGE and XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern. It seems like a good test to do with this patch would be to set up a pgbench test on the master with a hash index replacing the usual btree index. Then, set up a standby and run a read-only pgbench on the standby while a read-write pgbench test runs on the master. Maybe you've already tried something like that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers