On Tue, Mar 3, 2015 at 3:13 AM, Andres Freund <and...@2ndquadrant.com> wrote:
>> toast_save_datum() is called with a heap_insert() call before heap
>> insertion for the tuple proper. We're relying on the assumption that
>> if there is no immediate super deletion record, things are fine. We
>> cannot speculatively insert into catalogs, BTW.
>
> I fail to see what prohibits e.g. another catalog modifying transaction
> to commit inbetween and distribute a new snapshot.

SnapBuildDistributeNewCatalogSnapshot()/REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT
does look like a problematic case. It's problematic specifically
because it involves some xact queuing a change to *some other* xact -
we cannot assume that this won't happen between WAL-logging within
heap_insert() and heap_delete(). Can you think of any other such
cases?

I think that REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID should be fine.
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID is not an issue either. In
both cases I believe my assumption does not break because there won't
be writes to system catalogs between the relevant heap_insert() and
heap_delete() calls, which are tightly coupled, and also because
speculative insertion into catalogs is unsupported. That just leaves
the non-"*CHANGE_INTERAL_* "(regular DML) cases, which should also be
fine.

As for SnapBuildDistributeNewCatalogSnapshot(), I'm open to
suggestions. Do you see any opportunity around making assumptions
about heavyweight locking making the interleaving of some
REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT change between a
REORDER_BUFFER_CHANGE_INTERNAL_INSERT change and a
REORDER_BUFFER_CHANGE_INTERNAL_DELETE change? AFAICT, that's all this
approach really needs to worry about (or the interleaving of something
else not under the control of speculative insertion - doesn't have to
be a REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, that's just the most
obvious problematic case).

>> Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case
>> next, in the case that we need to care about (when the tuple was super
>> deleted immediately afterwards)?
>
> It's irrelevant whether you care about it or not. Your
> ReorderBufferIterTXNNext() consumes a change that needs to actually be
> processed. It could very well be something important like a new
> snapshot.

But it is actually processed, in the next iteration (we avoid calling
ReorderBufferIterTXNNext() at the top of the loop if it has already
been called for that iteration as part of peeking ahead). AFAICT all
that is at issue is the safety of one particular assumption I've made:
that it is safe to assume that there will never be a super deletion in
the event of not seeing a super deletion change immediately following
a speculative insertion change within some xact when decoding. It's
still going to be processed if it's something else. The implementation
will, however, fail to consolidate the speculative insertion change
and super deletion change if my assumption is ever faulty.

This whole approach doesn't seem to be all that different to how there
is currently coordination within ReorderBufferCommit() between
TOAST-related changes, and changes to the relation proper. In fact,
I've now duplicated the way the IsToastRelation() case performs
"dlist_delete(&change->node)" in order to avoid chunk reuse in the
event of spilling to disk. Stress testing by decreasing
"max_changes_in_memory" to 1 does not exhibit broken behavior, I
believe (although that does break the test_decoding regression tests
on the master branch, FWIW). Also, obviously I have not considered the
SnapBuildDistributeNewCatalogSnapshot() case too closely.

I attach an updated patch that I believe at least handles the
serialization aspects correctly, FYI. Note that I assert that
REORDER_BUFFER_CHANGE_INTERNAL_DELETE follows a
REORDER_BUFFER_CHANGE_INTERNAL_INSERT, so if you can find a way to
break the assumption it should cause an assertion failure, which is
something to look out for during testing.

>> While what I have here *is* very ugly, I see no better alternative. By
>> doing what you suggest, we'd be finding a special case way of safely
>> violating the general "WAL log-before-data" rule.
>
> No, it won't. We don't WAL log modifications that don't need to persist
> in a bunch of places. It'd be perfectly fine to start upsert with a
> 'acquiring a insertion lock' record that pretty much only contains the
> item pointer (and a FPI if necessary to prevent torn pages). And then
> only log the full new record once it's sure that the whole thing needs
> to survive.  Redo than can simply clean out the size touched by the
> insertion lock.

That seems like a lot of work in the general case to not do something
that will only very rarely need to occur anyway. The optimistic
approach of value locking scheme #2 has a small race that is detected
(which necessitates super deletion), but that will only very rarely be
required. Also, there is value in making super deletion (and
speculative insertion) as close as possible to regular deletion and
insertion.

-- 
Peter Geoghegan
From c445308ead8aa99419820769fda8bbccc6eeb980 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sun, 22 Feb 2015 17:02:14 -0800
Subject: [PATCH 4/8] Logical decoding support for ON CONFLICT UPDATE

Transaction commit reassembly now consolidates speculative insertion
related changes.  Using a "peek-ahead", it is determined if a
speculative insertion went on to be super deleted.  If that is not the
case, then it is reported as an ordinary insertion to decoding plugins.
---
 src/backend/replication/logical/decode.c        | 11 +++-
 src/backend/replication/logical/reorderbuffer.c | 83 ++++++++++++++++++++++++-
 src/include/replication/reorderbuffer.h         | 11 +++-
 3 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e7614bd..3c05866 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -591,7 +591,11 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_INSERT;
+	if (!(xlrec->flags & XLOG_HEAP_SPECULATIVE_TUPLE))
+		change->action = REORDER_BUFFER_CHANGE_INSERT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_INSERT;
+
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
 
 	if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
@@ -682,7 +686,10 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+	if (!(xlrec->flags & XLOG_HEAP_SPECULATIVE_TUPLE))
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+	else
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_DELETE;
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 20bb3b7..e8ad2ee 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -401,6 +401,8 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 		case REORDER_BUFFER_CHANGE_INSERT:
 		case REORDER_BUFFER_CHANGE_UPDATE:
 		case REORDER_BUFFER_CHANGE_DELETE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_INSERT:
+		case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
 			if (change->data.tp.newtuple)
 			{
 				ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
@@ -1314,6 +1316,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 	PG_TRY();
 	{
 		ReorderBufferChange *change;
+		ReorderBufferChange *peekchange = NULL;
 
 		if (using_subtxn)
 			BeginInternalSubTransaction("replay");
@@ -1323,16 +1326,26 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 		rb->begin(rb, txn);
 
 		iterstate = ReorderBufferIterTXNInit(rb, txn);
-		while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
+		while ((change = peekchange ? peekchange :
+				ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
 		{
 			Relation	relation = NULL;
 			Oid			reloid;
 
+			/* Forget about previous peek ahead */
+			if (peekchange)
+				peekchange = NULL;
+			else
+				Assert(change->action !=
+					   REORDER_BUFFER_CHANGE_INTERNAL_DELETE);
+
 			switch (change->action)
 			{
 				case REORDER_BUFFER_CHANGE_INSERT:
 				case REORDER_BUFFER_CHANGE_UPDATE:
 				case REORDER_BUFFER_CHANGE_DELETE:
+				case REORDER_BUFFER_CHANGE_INTERNAL_INSERT:
+				case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
 					Assert(snapshot_now);
 
 					reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
@@ -1374,7 +1387,65 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						else if (!IsToastRelation(relation))
 						{
 							ReorderBufferToastReplace(rb, txn, relation, change);
-							rb->apply_change(rb, txn, relation, change);
+
+							/*
+							 * Kludge:  Speculative insertion occasionally
+							 * makes use of "super deletion" -- an
+							 * implementation defined delete of a speculatively
+							 * inserted tuple.  Neither the super deletion, nor
+							 * the insertion (which must be the prior record
+							 * type) are included in the final assembly when
+							 * the tuple was super-deleted.  Otherwise, an
+							 * ordinary insertion is assembled.
+							 */
+							if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_INSERT)
+							{
+								/*
+								 * Need to ensure the memory used by promise
+								 * tuple isn't freed till we're done verifying
+								 * that there is no super deletion that
+								 * immediately follows.  Otherwise it could get
+								 * freed/reused while restoring spooled data
+								 * from disk.
+								 */
+								dlist_delete(&change->node);
+								peekchange = ReorderBufferIterTXNNext(rb, iterstate);
+								if (!peekchange || peekchange->action !=
+									REORDER_BUFFER_CHANGE_INTERNAL_DELETE)
+								{
+									/* Report as proper insert to client */
+									change->action = REORDER_BUFFER_CHANGE_INSERT;
+									rb->apply_change(rb, txn, relation,
+													 change);
+								}
+								else if (peekchange)
+								{
+									Assert(RelFileNodeEquals(change->data.tp.relnode,
+															 peekchange->data.tp.relnode));
+								}
+
+								ReorderBufferReturnChange(rb, change);
+							}
+							else if (change->action ==
+									 REORDER_BUFFER_CHANGE_INTERNAL_DELETE)
+							{
+								/*
+								 * The REORDER_BUFFER_CHANGE_INTERNAL_INSERT
+								 * case makes an assumption that
+								 * REORDER_BUFFER_CHANGE_INTERNAL_DELETE
+								 * changes immediately follows reliably iff a
+								 * speculatively inserted tuple was actually
+								 * super-deleted.
+								 */
+							}
+							else
+							{
+								/*
+								 * Handle non-speculative insertion related
+								 * changes
+								 */
+								rb->apply_change(rb, txn, relation, change);
+							}
 
 							/*
 							 * Only clear reassembled toast chunks if we're
@@ -2003,6 +2074,10 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_UPDATE:
 			/* fall through */
 		case REORDER_BUFFER_CHANGE_DELETE:
+			/* fall through */
+		case REORDER_BUFFER_CHANGE_INTERNAL_INSERT:
+			/* fall through */
+		case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
 			{
 				char	   *data;
 				ReorderBufferTupleBuf *oldtup,
@@ -2258,6 +2333,10 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		case REORDER_BUFFER_CHANGE_UPDATE:
 			/* fall through */
 		case REORDER_BUFFER_CHANGE_DELETE:
+			/* fall through */
+		case REORDER_BUFFER_CHANGE_INTERNAL_INSERT:
+			/* fall through */
+		case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
 			if (change->data.tp.newtuple)
 			{
 				Size		len = offsetof(ReorderBufferTupleBuf, t_data) +
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f1e0f57..d694981 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -43,6 +43,13 @@ typedef struct ReorderBufferTupleBuf
  * and ComboCids in the same list with the user visible INSERT/UPDATE/DELETE
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
+ *
+ * The REORDER_BUFFER_CHANGE_INTERNAL_INSERT and
+ * REORDER_BUFFER_CHANGE_INTERNAL_DELETE changes concern "super deletion",
+ * which is a mechanism that speculative insertion makes use of to handle
+ * conflicts.  At transaction reassembly these will be consolidated, and so
+ * decoding plugins will only ever handle REORDER_BUFFER_CHANGE_INSERT changes
+ * here too (in the common case where speculative insertion works out).
  */
 enum ReorderBufferChangeType
 {
@@ -51,7 +58,9 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_DELETE,
 	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
-	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID
+	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
+	REORDER_BUFFER_CHANGE_INTERNAL_INSERT,
+	REORDER_BUFFER_CHANGE_INTERNAL_DELETE
 };
 
 /*
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to