Let me explain a bit more what's the mechanism at play.

When scanning a block range to summarize it, we use an MVCC snapshot.
Any tuples that are inserted by transactions that appear as in-progress
to that snapshot will not be seen by that range scan; therefore we need
another mechanism to include their values in the summary tuple.

This mechanism is a "placeholder tuple."  We insert this tuple into the
index; any process that tries to insert into the range will update the
placeholder tuple.  All insertions that are not visible according to the
MVCC scan must update the placeholder tuple.

So there are two events that are time-critical regarding each other: the
insertion of the placeholder tuple, and the acquisition of the snapshot
for the MVCC scan.  All transactions in progress for the snapshot *must*
see the placeholder tuple; therefore, we make sure we insert the
placeholder tuple first, and acquire the snapshot later.

This is all regardless of a transaction being a user-declared
transaction block, or an implicit transaction opened by some command;
regardless of vacuum being user-invoked or autovacuum.  The description
of the above is critical for correctness in all those cases.

I assumed (wrongly) that vacuum would not acquire a snapshot.  (For one
thing, we already rely for correctness on lazy vacuum not holding xmin
back, so this is not entirely without sense.)  The problem is that
PortalRunUtility does it inconditionally.  This is not a problem in
read-committed mode, because we simply freshen the snapshot each time
and all is well.  In transaction-snapshot mode, taking another snapshot
is a no-op and thus the whole thing is broken.  The code is right to
raise an error.  (I don't really give a damn whether the error message
is 100% correct or not.  Fixing that part is trivial.)

Even assuming that PortalRunUtility would not grab a snapshot at start,
things would still be broken if vacuum processed more than one range in
transaction-snapshot isolation mode: insert placeholder tuple, grab
snapshot for the first range, scan range: all is well up to this point.
Then we need to process the second range: insert placeholder tuple, grab
snapshot ... which reuses the previous snapshot, older than the
placeholder tuple.  Oops.

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
as I can see, that should completely close the hole.  This requires
patching IndexBuildHeapRangeScan() to allow for that.

Untested patch attached.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 360b26e..9883c36 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -951,9 +951,15 @@ summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
 	 * Execute the partial heap scan covering the heap blocks in the specified
 	 * page range, summarizing the heap tuples in it.  This scan stops just
 	 * short of brinbuildCallback creating the new index entry.
+	 *
+	 * Note that it is critical we use the "latest snapshot" mode of
+	 * IndexBuildHeapRangeScan here: otherwise, in transaction-snapshot
+	 * isolation mode the snapshot used would be earlier than the insertion of
+	 * the placeholder tuple, leading to values inserted concurrently not
+	 * being considered in the summary info of the block range.
 	 */
 	state->bs_currRangeStart = heapBlk;
-	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false,
+	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false, true,
 							heapBlk, state->bs_pagesPerRange,
 							brinbuildCallback, (void *) state);
 
@@ -1066,28 +1072,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 				 * additional precautions to avoid losing the additional
 				 * tuples; see comments in summarize_range.  Set the
 				 * concurrent flag, which causes IndexBuildHeapRangeScan to
-				 * use a snapshot other than SnapshotAny, and silences
+				 * use LatestSnapshot rather than SnapshotAny, and silences
 				 * warnings emitted there.
 				 */
 				indexInfo->ii_Concurrent = true;
-
-				/*
-				 * If using transaction-snapshot mode, it would be possible
-				 * for another transaction to insert a tuple that's not
-				 * visible to our snapshot if we have already acquired one,
-				 * when in snapshot-isolation mode; therefore, disallow this
-				 * from running in such a transaction unless a snapshot hasn't
-				 * been acquired yet.
-				 *
-				 * This code is called by VACUUM and
-				 * brin_summarize_new_values. Have the error message mention
-				 * the latter because VACUUM cannot run in a transaction and
-				 * thus cannot cause this issue.
-				 */
-				if (IsolationUsesXactSnapshot() && FirstSnapshotSet)
-					ereport(ERROR,
-							(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-							 errmsg("brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot")));
 			}
 			summarize_range(indexInfo, state, heapRel, heapBlk);
 
@@ -1111,7 +1099,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 	/* free resources */
 	brinRevmapTerminate(revmap);
 	if (state)
+	{
 		terminate_brin_buildstate(state);
+		pfree(indexInfo);
+	}
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69f35c9..2390e36 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2161,6 +2161,7 @@ IndexBuildHeapScan(Relation heapRelation,
 {
 	return IndexBuildHeapRangeScan(heapRelation, indexRelation,
 								   indexInfo, allow_sync,
+								   false,
 								   0, InvalidBlockNumber,
 								   callback, callback_state);
 }
@@ -2170,12 +2171,17 @@ IndexBuildHeapScan(Relation heapRelation,
  * number of blocks are scanned.  Scan to end-of-rel can be signalled by
  * passing InvalidBlockNumber as numblocks.  Note that restricting the range
  * to scan cannot be done when requesting syncscan.
+ *
+ * If use_latest_snapshot, the snapshot used is not the transaction snapshot,
+ * but obtained using GetLatestSnapshot().  (Note this only affects scans
+ * marked ii_Concurrent).
  */
 double
 IndexBuildHeapRangeScan(Relation heapRelation,
 						Relation indexRelation,
 						IndexInfo *indexInfo,
 						bool allow_sync,
+						bool use_latest_snapshot,
 						BlockNumber start_blockno,
 						BlockNumber numblocks,
 						IndexBuildCallback callback,
@@ -2234,7 +2240,9 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 	 */
 	if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		snapshot = use_latest_snapshot ?
+			GetLatestSnapshot() : GetTransactionSnapshot();
+		snapshot = RegisterSnapshot(snapshot);
 		OldestXmin = InvalidTransactionId;		/* not used */
 	}
 	else
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 8b3b28d..4ad5ac3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -105,6 +105,7 @@ extern double IndexBuildHeapRangeScan(Relation heapRelation,
 						Relation indexRelation,
 						IndexInfo *indexInfo,
 						bool allow_sync,
+						bool use_latest_snapshot,
 						BlockNumber start_blockno,
 						BlockNumber end_blockno,
 						IndexBuildCallback callback,
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to