Hi, On 2019-04-04 21:04:49 -0700, Andres Freund wrote: > On 2019-04-04 23:57:58 -0400, Tom Lane wrote: > > Andres Freund <and...@anarazel.de> writes: > > > I think the right approach would be to do all of this in heap_insert and > > > heap_multi_insert. Whenever starting to work on a page, if INSERT_FROZEN > > > is specified, remember whether it is either currently empty, or is > > > already marked as all-visible. If previously empty, mark it as all > > > visible at the end. If already all visible, there's no need to change > > > that. If not yet all-visible, no need to do anything, since it can't > > > have been inserted with COPY FREEZE. Do you see any problem doing it > > > that way? > > > > Do we want to add overhead to these hot-spot routines for this purpose? > > For heap_multi_insert I can't see it being a problem - it's only used > from copy.c, and the cost should be "smeared" over many tuples. I'd > assume that compared with locking a page, WAL logging, etc, it'd not > even meaningfully show up for heap_insert. Especially because we already > have codepaths for options & HEAP_INSERT_FROZEN in > heap_prepare_insert(), and I'd assume those could be combined. > > I think we should measure it, but I don't think that one or two > additional, very well predictd, branches are going to be measurable in > in those routines. > > The patch, as implemented, has modifications in > RelationGetBufferForTuple(), that seems like they'd be more expensive.
Here's a *prototype* patch for this. It only implements what I described for heap_multi_insert, not for plain inserts. I wanted to see what others think before investing additional time. I don't think it's possible to see the overhead of the changed code in heap_multi_insert(), and probably - with less confidence - that it's also going to be ok for heap_insert(). But we gotta measure that. This avoids an extra WAL record for setting empty pages to all visible, by adding XLH_INSERT_ALL_VISIBLE_SET & XLH_INSERT_ALL_FROZEN_SET, and setting those when appropriate in heap_multi_insert. Unfortunately currently visibilitymap_set() doesn't really properly allow to do this, as it has embedded WAL logging for heap. I think we should remove the WAL logging from visibilitymap_set(), and move it to a separate, heap specific, function. Right now different tableams e.g. would have to reimplement visibilitymap_set(), so that's a second need to separate that functionality. Let me try to come up with a proposal. The patch currently does a vmbuffer_pin() while holding an exclusive lwlock on the page. That's something we normally try to avoid - but I think it's probably OK here, because INSERT_FROZEN can only be used to insert into a new relfilenode (which thus no other session can see). I think it's preferrable to have this logic in specific to the INSERT_FROZEN path, rather than adding nontrivial complications to RelationGetBufferForTuple(). I noticed that, before this patch, we do a if (vmbuffer != InvalidBuffer) ReleaseBuffer(vmbuffer); after every filled page - that doesn't strike me as particularly smart - it's pretty likely that the next heap page to be filled is going to be on the same vm page as the previous iteration. I noticed one small oddity that I think is common to all the approaches presented in this thread so far. After BEGIN; TRUNCATE foo; COPY foo(i) FROM '/tmp/foo' WITH FREEZE; COPY foo(i) FROM '/tmp/foo' WITH FREEZE; COPY foo(i) FROM '/tmp/foo' WITH FREEZE; COMMIT; we currently end up with pages like: ┌───────┬───────────┬──────────┬───────┬───────┬───────┬─────────┬──────────┬─────────┬───────────┐ │ blkno │ lsn │ checksum │ flags │ lower │ upper │ special │ pagesize │ version │ prune_xid │ ├───────┼───────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼───────────┤ │ 0 │ 0/50B5488 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 1 │ 0/50B6360 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 2 │ 0/50B71B8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 3 │ 0/50B8028 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 4 │ 0/50B8660 │ 0 │ 4 │ 408 │ 5120 │ 8192 │ 8192 │ 4 │ 0 │ │ 5 │ 0/50B94B8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 6 │ 0/50BA328 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 7 │ 0/50BB180 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 8 │ 0/50BBFD8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 9 │ 0/50BCF88 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 10 │ 0/50BDDE0 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 11 │ 0/50BEC50 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 12 │ 0/50BFAA8 │ 0 │ 4 │ 928 │ 960 │ 8192 │ 8192 │ 4 │ 0 │ │ 13 │ 0/50C06F8 │ 0 │ 4 │ 792 │ 2048 │ 8192 │ 8192 │ 4 │ 0 │ └───────┴───────────┴──────────┴───────┴───────┴───────┴─────────┴──────────┴─────────┴───────────┘ (14 rows) Note how block 4 has more space available. That's because the visibilitymap_pin() called in the first COPY has to vm_extend(), which in turn does: /* * Send a shared-inval message to force other backends to close any smgr * references they may have for this rel, which we are about to change. * This is a useful optimization because it means that backends don't have * to keep checking for creation or extension of the file, which happens * infrequently. */ CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode); which invalidates ->rd_smgr->smgr_targblock *after* the first COPY, because that's when the pending smgr invalidations are sent out. That's far from great, but it doesn't seem to be this patch's fault. It seems to me we need a separate invalidation that doesn't close the whole smgr relation, but just invalidates the VM specific fields. Greetings, Andres Freund
>From 03a50acf74d9f9e7876b44b75ce94b106f6e4147 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Sun, 7 Apr 2019 18:00:10 -0700 Subject: [PATCH 2/2] WIP-copy-freeze-should-actually-freeze-right --- src/backend/access/heap/heapam.c | 124 +++++++++++++++++++++--- src/backend/access/heap/visibilitymap.c | 2 +- src/include/access/heapam_xlog.h | 2 + 3 files changed, 116 insertions(+), 12 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index a05b6a07ad0..d4040911956 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2115,6 +2115,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, int ndone; PGAlignedBlock scratch; Page page; + Buffer vmbuffer = InvalidBuffer; bool needwal; Size saveFreeSpace; bool need_tuple_data = RelationIsLogicallyLogged(relation); @@ -2169,9 +2170,11 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, while (ndone < ntuples) { Buffer buffer; - Buffer vmbuffer = InvalidBuffer; + bool starting_with_empty_page; bool all_visible_cleared = false; + bool all_frozen_set = false; int nthispage; + XLogRecPtr recptr; CHECK_FOR_INTERRUPTS(); @@ -2184,6 +2187,11 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, &vmbuffer, NULL); page = BufferGetPage(buffer); + starting_with_empty_page = PageGetMaxOffsetNumber(page) == 0; + + if (starting_with_empty_page && (options & HEAP_INSERT_FROZEN)) + all_frozen_set = true; + /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); @@ -2209,7 +2217,14 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, log_heap_new_cid(relation, heaptup); } - if (PageIsAllVisible(page)) + /* + * If the page is all visible, need to clear that, unless we're only + * going to add further frozen rows to it. + * + * If we're only adding already frozen rows, and the page was + * previously empty, mark it as all-visible. + */ + if (PageIsAllVisible(page) && !(options & HEAP_INSERT_FROZEN)) { all_visible_cleared = true; PageClearAllVisible(page); @@ -2217,6 +2232,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, BufferGetBlockNumber(buffer), vmbuffer, VISIBILITYMAP_VALID_BITS); } + else if (all_frozen_set) + PageSetAllVisible(page); /* * XXX Should we set PageSetPrunable on this page ? See heap_insert() @@ -2227,7 +2244,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, /* XLOG stuff */ if (needwal) { - XLogRecPtr recptr; xl_heap_multi_insert *xlrec; uint8 info = XLOG_HEAP2_MULTI_INSERT; char *tupledata; @@ -2240,8 +2256,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, * If the page was previously empty, we can reinit the page * instead of restoring the whole thing. */ - init = (ItemPointerGetOffsetNumber(&(heaptuples[ndone]->t_self)) == FirstOffsetNumber && - PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1); + init = starting_with_empty_page; /* allocate xl_heap_multi_insert struct from the scratch area */ xlrec = (xl_heap_multi_insert *) scratchptr; @@ -2259,7 +2274,17 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, /* the rest of the scratch space is used for tuple data */ tupledata = scratchptr; - xlrec->flags = all_visible_cleared ? XLH_INSERT_ALL_VISIBLE_CLEARED : 0; + xlrec->flags = 0; + Assert((all_visible_cleared == 0 && all_frozen_set == 0) || + all_visible_cleared != all_frozen_set); + if (all_visible_cleared) + xlrec->flags |= XLH_INSERT_ALL_VISIBLE_CLEARED; + if (all_frozen_set) + { + xlrec->flags |= XLH_INSERT_ALL_VISIBLE_SET; + xlrec->flags |= XLH_INSERT_ALL_FROZEN_SET; + } + xlrec->ntuples = nthispage; /* @@ -2333,13 +2358,46 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, END_CRIT_SECTION(); - UnlockReleaseBuffer(buffer); - if (vmbuffer != InvalidBuffer) - ReleaseBuffer(vmbuffer); + if (all_frozen_set) + { + Assert(PageIsAllVisible(page)); + /* + * Having to potentially read the page while holding an exclusive + * lock on the page isn't great. But we only get here if + * HEAP_INSERT_FROZEN is set, and we only do so if the table isn't + * readable outside of this sessoin. Therefore doing IO here isn't + * that bad. + */ + visibilitymap_pin(relation, BufferGetBlockNumber(buffer), &vmbuffer); + + /* + * FIXME: setting recptr here is a dirty dirty hack, to prevent + * visibilitymap_set() from WAL logging. + * + * It's fine to use InvalidTransactionId here - this is only used + * when HEAP_INSERT_FROZEN is specified, which intentionally + * violates visibility rules. + */ + visibilitymap_set(relation, BufferGetBlockNumber(buffer), buffer, + recptr, vmbuffer, + InvalidTransactionId, + VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN); + } + + UnlockReleaseBuffer(buffer); ndone += nthispage; + + /* + * NB: Only release vmbuffer after inserting all tuples - it's fairly + * likely that we'll insert into subsequent heap pages that are likely + * to use the same vm page. + */ } + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + /* * We're done with the actual inserts. Check for conflicts again, to * ensure that all rw-conflicts in to these inserts are detected. Without @@ -8176,6 +8234,7 @@ heap_xlog_multi_insert(XLogReaderState *record) BlockNumber blkno; Buffer buffer; Page page; + Page vmpage; union { HeapTupleHeaderData hdr; @@ -8200,13 +8259,54 @@ heap_xlog_multi_insert(XLogReaderState *record) * The visibility map may need to be fixed even if the heap page is * already up-to-date. */ - if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED) + if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED | + XLH_INSERT_ALL_VISIBLE_SET | + XLH_INSERT_ALL_FROZEN_SET)) { Relation reln = CreateFakeRelcacheEntry(rnode); Buffer vmbuffer = InvalidBuffer; visibilitymap_pin(reln, blkno, &vmbuffer); - visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); + + if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_CLEARED)) + { + Assert(!(xlrec->flags & (XLH_INSERT_ALL_FROZEN_SET | + XLH_INSERT_ALL_VISIBLE_SET))); + visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); + } + else + { + int vmbits = 0; + + if (xlrec->flags & (XLH_INSERT_ALL_VISIBLE_SET)) + vmbits |= VISIBILITYMAP_ALL_VISIBLE; + if (xlrec->flags & (XLH_INSERT_ALL_FROZEN_SET)) + vmbits |= VISIBILITYMAP_ALL_FROZEN; + + vmpage = BufferGetPage(vmbuffer); + + /* + * Don't set the bit if replay has already passed this point. + * + * It might be safe to do this unconditionally; if replay has passed + * this point, we'll replay at least as far this time as we did + * before, and if this bit needs to be cleared, the record responsible + * for doing so should be again replayed, and clear it. For right + * now, out of an abundance of conservatism, we use the same test here + * we did for the heap page. If this results in a dropped bit, no + * real harm is done; and the next VACUUM will fix it. + * + * XXX: This seems entirely unnecessary? + * + * FIXME: Theoretically we should only do this after we've + * modified the heap - but it's safe to do it here I think, + * because this means that the page previously was empty. + */ + if (lsn > PageGetLSN(vmpage)) + visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer, + InvalidTransactionId, vmbits); + } + ReleaseBuffer(vmbuffer); FreeFakeRelcacheEntry(reln); } @@ -8284,6 +8384,8 @@ heap_xlog_multi_insert(XLogReaderState *record) if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED) PageClearAllVisible(page); + if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_SET) + PageSetAllVisible(page); MarkBufferDirty(buffer); } diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 64dfe06b261..8a9f4f4c42e 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -253,7 +253,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk); #endif - Assert(InRecovery || XLogRecPtrIsInvalid(recptr)); + //Assert(InRecovery || XLogRecPtrIsInvalid(recptr)); Assert(InRecovery || BufferIsValid(heapBuf)); Assert(flags & VISIBILITYMAP_VALID_BITS); diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index 22cd13c47fc..557cc7dadcc 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -67,6 +67,8 @@ #define XLH_INSERT_LAST_IN_MULTI (1<<1) #define XLH_INSERT_IS_SPECULATIVE (1<<2) #define XLH_INSERT_CONTAINS_NEW_TUPLE (1<<3) +#define XLH_INSERT_ALL_VISIBLE_SET (1<<4) +#define XLH_INSERT_ALL_FROZEN_SET (1<<5) /* * xl_heap_update flag values, 8 bits are available. -- 2.21.0.dirty