On Tue, Mar 3, 2015 at 3:13 AM, Andres Freund <and...@2ndquadrant.com> wrote: >> toast_save_datum() is called with a heap_insert() call before heap >> insertion for the tuple proper. We're relying on the assumption that >> if there is no immediate super deletion record, things are fine. We >> cannot speculatively insert into catalogs, BTW. > > I fail to see what prohibits e.g. another catalog modifying transaction > to commit inbetween and distribute a new snapshot.
SnapBuildDistributeNewCatalogSnapshot()/REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT does look like a problematic case. It's problematic specifically because it involves some xact queuing a change to *some other* xact - we cannot assume that this won't happen between WAL-logging within heap_insert() and heap_delete(). Can you think of any other such cases? I think that REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID should be fine. REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID is not an issue either. In both cases I believe my assumption does not break because there won't be writes to system catalogs between the relevant heap_insert() and heap_delete() calls, which are tightly coupled, and also because speculative insertion into catalogs is unsupported. That just leaves the non-"*CHANGE_INTERAL_* "(regular DML) cases, which should also be fine. As for SnapBuildDistributeNewCatalogSnapshot(), I'm open to suggestions. Do you see any opportunity around making assumptions about heavyweight locking making the interleaving of some REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT change between a REORDER_BUFFER_CHANGE_INTERNAL_INSERT change and a REORDER_BUFFER_CHANGE_INTERNAL_DELETE change? AFAICT, that's all this approach really needs to worry about (or the interleaving of something else not under the control of speculative insertion - doesn't have to be a REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, that's just the most obvious problematic case). >> Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case >> next, in the case that we need to care about (when the tuple was super >> deleted immediately afterwards)? > > It's irrelevant whether you care about it or not. Your > ReorderBufferIterTXNNext() consumes a change that needs to actually be > processed. It could very well be something important like a new > snapshot. But it is actually processed, in the next iteration (we avoid calling ReorderBufferIterTXNNext() at the top of the loop if it has already been called for that iteration as part of peeking ahead). AFAICT all that is at issue is the safety of one particular assumption I've made: that it is safe to assume that there will never be a super deletion in the event of not seeing a super deletion change immediately following a speculative insertion change within some xact when decoding. It's still going to be processed if it's something else. The implementation will, however, fail to consolidate the speculative insertion change and super deletion change if my assumption is ever faulty. This whole approach doesn't seem to be all that different to how there is currently coordination within ReorderBufferCommit() between TOAST-related changes, and changes to the relation proper. In fact, I've now duplicated the way the IsToastRelation() case performs "dlist_delete(&change->node)" in order to avoid chunk reuse in the event of spilling to disk. Stress testing by decreasing "max_changes_in_memory" to 1 does not exhibit broken behavior, I believe (although that does break the test_decoding regression tests on the master branch, FWIW). Also, obviously I have not considered the SnapBuildDistributeNewCatalogSnapshot() case too closely. I attach an updated patch that I believe at least handles the serialization aspects correctly, FYI. Note that I assert that REORDER_BUFFER_CHANGE_INTERNAL_DELETE follows a REORDER_BUFFER_CHANGE_INTERNAL_INSERT, so if you can find a way to break the assumption it should cause an assertion failure, which is something to look out for during testing. >> While what I have here *is* very ugly, I see no better alternative. By >> doing what you suggest, we'd be finding a special case way of safely >> violating the general "WAL log-before-data" rule. > > No, it won't. We don't WAL log modifications that don't need to persist > in a bunch of places. It'd be perfectly fine to start upsert with a > 'acquiring a insertion lock' record that pretty much only contains the > item pointer (and a FPI if necessary to prevent torn pages). And then > only log the full new record once it's sure that the whole thing needs > to survive. Redo than can simply clean out the size touched by the > insertion lock. That seems like a lot of work in the general case to not do something that will only very rarely need to occur anyway. The optimistic approach of value locking scheme #2 has a small race that is detected (which necessitates super deletion), but that will only very rarely be required. Also, there is value in making super deletion (and speculative insertion) as close as possible to regular deletion and insertion. -- Peter Geoghegan
From c445308ead8aa99419820769fda8bbccc6eeb980 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Sun, 22 Feb 2015 17:02:14 -0800 Subject: [PATCH 4/8] Logical decoding support for ON CONFLICT UPDATE Transaction commit reassembly now consolidates speculative insertion related changes. Using a "peek-ahead", it is determined if a speculative insertion went on to be super deleted. If that is not the case, then it is reported as an ordinary insertion to decoding plugins. --- src/backend/replication/logical/decode.c | 11 +++- src/backend/replication/logical/reorderbuffer.c | 83 ++++++++++++++++++++++++- src/include/replication/reorderbuffer.h | 11 +++- 3 files changed, 100 insertions(+), 5 deletions(-) diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index e7614bd..3c05866 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -591,7 +591,11 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) return; change = ReorderBufferGetChange(ctx->reorder); - change->action = REORDER_BUFFER_CHANGE_INSERT; + if (!(xlrec->flags & XLOG_HEAP_SPECULATIVE_TUPLE)) + change->action = REORDER_BUFFER_CHANGE_INSERT; + else + change->action = REORDER_BUFFER_CHANGE_INTERNAL_INSERT; + memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode)); if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE) @@ -682,7 +686,10 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) return; change = ReorderBufferGetChange(ctx->reorder); - change->action = REORDER_BUFFER_CHANGE_DELETE; + if (!(xlrec->flags & XLOG_HEAP_SPECULATIVE_TUPLE)) + change->action = REORDER_BUFFER_CHANGE_DELETE; + else + change->action = REORDER_BUFFER_CHANGE_INTERNAL_DELETE; memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode)); diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 20bb3b7..e8ad2ee 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -401,6 +401,8 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change) case REORDER_BUFFER_CHANGE_INSERT: case REORDER_BUFFER_CHANGE_UPDATE: case REORDER_BUFFER_CHANGE_DELETE: + case REORDER_BUFFER_CHANGE_INTERNAL_INSERT: + case REORDER_BUFFER_CHANGE_INTERNAL_DELETE: if (change->data.tp.newtuple) { ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple); @@ -1314,6 +1316,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, PG_TRY(); { ReorderBufferChange *change; + ReorderBufferChange *peekchange = NULL; if (using_subtxn) BeginInternalSubTransaction("replay"); @@ -1323,16 +1326,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, rb->begin(rb, txn); iterstate = ReorderBufferIterTXNInit(rb, txn); - while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL) + while ((change = peekchange ? peekchange : + ReorderBufferIterTXNNext(rb, iterstate)) != NULL) { Relation relation = NULL; Oid reloid; + /* Forget about previous peek ahead */ + if (peekchange) + peekchange = NULL; + else + Assert(change->action != + REORDER_BUFFER_CHANGE_INTERNAL_DELETE); + switch (change->action) { case REORDER_BUFFER_CHANGE_INSERT: case REORDER_BUFFER_CHANGE_UPDATE: case REORDER_BUFFER_CHANGE_DELETE: + case REORDER_BUFFER_CHANGE_INTERNAL_INSERT: + case REORDER_BUFFER_CHANGE_INTERNAL_DELETE: Assert(snapshot_now); reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode, @@ -1374,7 +1387,65 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid, else if (!IsToastRelation(relation)) { ReorderBufferToastReplace(rb, txn, relation, change); - rb->apply_change(rb, txn, relation, change); + + /* + * Kludge: Speculative insertion occasionally + * makes use of "super deletion" -- an + * implementation defined delete of a speculatively + * inserted tuple. Neither the super deletion, nor + * the insertion (which must be the prior record + * type) are included in the final assembly when + * the tuple was super-deleted. Otherwise, an + * ordinary insertion is assembled. + */ + if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_INSERT) + { + /* + * Need to ensure the memory used by promise + * tuple isn't freed till we're done verifying + * that there is no super deletion that + * immediately follows. Otherwise it could get + * freed/reused while restoring spooled data + * from disk. + */ + dlist_delete(&change->node); + peekchange = ReorderBufferIterTXNNext(rb, iterstate); + if (!peekchange || peekchange->action != + REORDER_BUFFER_CHANGE_INTERNAL_DELETE) + { + /* Report as proper insert to client */ + change->action = REORDER_BUFFER_CHANGE_INSERT; + rb->apply_change(rb, txn, relation, + change); + } + else if (peekchange) + { + Assert(RelFileNodeEquals(change->data.tp.relnode, + peekchange->data.tp.relnode)); + } + + ReorderBufferReturnChange(rb, change); + } + else if (change->action == + REORDER_BUFFER_CHANGE_INTERNAL_DELETE) + { + /* + * The REORDER_BUFFER_CHANGE_INTERNAL_INSERT + * case makes an assumption that + * REORDER_BUFFER_CHANGE_INTERNAL_DELETE + * changes immediately follows reliably iff a + * speculatively inserted tuple was actually + * super-deleted. + */ + } + else + { + /* + * Handle non-speculative insertion related + * changes + */ + rb->apply_change(rb, txn, relation, change); + } /* * Only clear reassembled toast chunks if we're @@ -2003,6 +2074,10 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn, case REORDER_BUFFER_CHANGE_UPDATE: /* fall through */ case REORDER_BUFFER_CHANGE_DELETE: + /* fall through */ + case REORDER_BUFFER_CHANGE_INTERNAL_INSERT: + /* fall through */ + case REORDER_BUFFER_CHANGE_INTERNAL_DELETE: { char *data; ReorderBufferTupleBuf *oldtup, @@ -2258,6 +2333,10 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, case REORDER_BUFFER_CHANGE_UPDATE: /* fall through */ case REORDER_BUFFER_CHANGE_DELETE: + /* fall through */ + case REORDER_BUFFER_CHANGE_INTERNAL_INSERT: + /* fall through */ + case REORDER_BUFFER_CHANGE_INTERNAL_DELETE: if (change->data.tp.newtuple) { Size len = offsetof(ReorderBufferTupleBuf, t_data) + diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h index f1e0f57..d694981 100644 --- a/src/include/replication/reorderbuffer.h +++ b/src/include/replication/reorderbuffer.h @@ -43,6 +43,13 @@ typedef struct ReorderBufferTupleBuf * and ComboCids in the same list with the user visible INSERT/UPDATE/DELETE * changes. Users of the decoding facilities will never see changes with * *_INTERNAL_* actions. + * + * The REORDER_BUFFER_CHANGE_INTERNAL_INSERT and + * REORDER_BUFFER_CHANGE_INTERNAL_DELETE changes concern "super deletion", + * which is a mechanism that speculative insertion makes use of to handle + * conflicts. At transaction reassembly these will be consolidated, and so + * decoding plugins will only ever handle REORDER_BUFFER_CHANGE_INSERT changes + * here too (in the common case where speculative insertion works out). */ enum ReorderBufferChangeType { @@ -51,7 +58,9 @@ enum ReorderBufferChangeType REORDER_BUFFER_CHANGE_DELETE, REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, - REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID + REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID, + REORDER_BUFFER_CHANGE_INTERNAL_INSERT, + REORDER_BUFFER_CHANGE_INTERNAL_DELETE }; /* -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers