On 02.11.2010 16:40, Heikki Linnakangas wrote:
On 02.11.2010 16:30, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakan...@enterprisedb.com> writes:
I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

You mean after CreateCheckPoint has determined the redo pointer, but
before it has written the checkpoint record? The new split can go ahead,
and the checkpoint doesn't need care about it. Recovery will start at
the redo pointer, so it will see the split record, and will know to
finish the incomplete split if necessary.

The logic is the same as with inCommit. Checkpoint will fetch the list
of in-progress splits some time after determining the redo-pointer. It
will then wait until all of those splits have finished. Any new splits
that begin after fetching the list don't affect the checkpoint.

inCommit can't be used as is, because it's tied to the Xid, but
something similar should work.

Here's a first draft of this, using the inCommit flag as is. It works, but suffers from starvation if you have a lot of concurrent multi-WAL-record actions. I tested that by running INSERTs to a table with tsvector field with a GiST index on it from five concurrent sessions, and saw checkpoints regularly busy-waiting for over a minute.

To avoid that, we need something a little bit more complicated than a boolean flag. I'm thinking of adding a counter beside the inCommit flag that's incremented every time a new multi-WAL-record action begins, so that the checkpoint process can distinguish between a new action that was started after deciding the REDO pointer and an old one that's still running.

(inCommit is a misnomer now, of course. Will need to find a better name..)

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 070cd92..ec95f89 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -17,6 +17,7 @@
 #include "access/gin.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/rel.h"
 
 /*
@@ -281,6 +282,7 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 	Page		page,
 				rpage,
 				lpage;
+	bool		splitInProgress = false;
 
 	/* remember root BlockNumber */
 	while (parent)
@@ -318,7 +320,7 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 
 			freeGinBtreeStack(stack);
 
-			return;
+			break; /* don't need to recurse to parent */
 		}
 		else
 		{
@@ -326,6 +328,20 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 			Page		newlpage;
 
 			/*
+			 * Mark ourselves as within a "checkpoint critical section". This
+			 * forces any concurrent checkpoint to wait until we've inserted the
+			 * parent pointer too. Without this, it is possible for the checkpoint to
+			 * set REDO after the split WAL record and quickly finish the checkpoint
+			 * before the insertion of the parent pointer has been WAL-logged. Replay
+			 * from such a checkpoint would not know that the split is incomplete.
+			 */
+			if (!!btree->index->rd_istemp)
+			{
+				splitInProgress = true;
+				MyProc->inCommit = true;
+			}
+
+			/*
 			 * newlpage is a pointer to memory page, it doesn't associate with
 			 * buffer, stack->buffer should be untouched
 			 */
@@ -402,7 +418,7 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 						buildStats->nEntryPages++;
 				}
 
-				return;
+				break; /* don't need to recurse to parent */
 			}
 			else
 			{
@@ -472,4 +488,11 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 		pfree(stack);
 		stack = parent;
 	}
+
+	/*
+	 * If we had to split a page, let checkpointer know that we're finished
+	 * now.
+	 */
+	if (splitInProgress)
+		MyProc->inCommit = false;
 }
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 9fd6167..a774a67 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -866,11 +866,3 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	incomplete_splits = NIL;
 }
-
-bool
-gin_safe_restartpoint(void)
-{
-	if (incomplete_splits)
-		return false;
-	return true;
-}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3054f98..4d4d964 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/indexfsm.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 
 const XLogRecPtr XLogRecPtrForTemp = {1, 1};
@@ -272,12 +273,27 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 	state.r = r;
 	state.key = itup->t_tid;
 	state.needInsertComplete = true;
+	/*
+	 * Mark ourselves as within a "checkpoint critical section". This
+	 * forces any concurrent checkpoint to wait until we've completed the
+	 * insert. Without this, it is possible for the checkpoint to set REDO
+	 * in the middle of a multi-record insert sequence, and quickly finish
+	 * the checkpoint before the next related page update record has been
+	 * WAL-logged. Replay from such a checkpoint would not see any of the
+	 * GiST WAL records, and would therefore not know that the operation
+	 * is incomplete.
+	 */
+	if (!r->rd_istemp)
+		MyProc->inCommit = true;
 
 	state.stack = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
 	state.stack->blkno = GIST_ROOT_BLKNO;
 
 	gistfindleaf(&state, giststate);
 	gistmakedeal(&state, giststate);
+
+	if (!r->rd_istemp)
+		MyProc->inCommit = false;
 }
 
 static bool
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index a90303e..7fdbe77 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -838,14 +838,6 @@ gist_xlog_cleanup(void)
 	MemoryContextDelete(insertCtx);
 }
 
-bool
-gist_safe_restartpoint(void)
-{
-	if (incomplete_inserts)
-		return false;
-	return true;
-}
-
 
 XLogRecData *
 formSplitRdata(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index eaad812..eb510bb 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -21,6 +21,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
+#include "storage/proc.h"
 #include "utils/inval.h"
 #include "utils/tqual.h"
 
@@ -63,7 +64,8 @@ static void _bt_insertonpg(Relation rel, Buffer buf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
-			   bool split_only_page);
+			   bool split_only_page,
+			   bool isleaf);
 static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz,
 		  IndexTuple newitem, bool newitemonleft);
@@ -176,7 +178,7 @@ top:
 	{
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
-		_bt_insertonpg(rel, buf, stack, itup, offset, false);
+		_bt_insertonpg(rel, buf, stack, itup, offset, false, true);
 	}
 	else
 	{
@@ -660,7 +662,8 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   bool isleaf)
 {
 	Page		page;
 	BTPageOpaque lpageop;
@@ -714,6 +717,10 @@ _bt_insertonpg(Relation rel,
 		 *----------
 		 */
 		_bt_insert_parent(rel, buf, rbuf, stack, is_root, is_only);
+
+		/* Finish the "checkpoint critical section" started in _bt_split() */
+		if (!rel->rd_istemp && isleaf)
+			MyProc->inCommit = false;
 	}
 	else
 	{
@@ -1139,6 +1146,17 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 	START_CRIT_SECTION();
 
 	/*
+	 * Mark ourselves as within a "checkpoint critical section". This
+	 * forces any concurrent checkpoint to wait until we've inserted the
+	 * parent pointer too. Without this, it is possible for the checkpoint to
+	 * set REDO after the split WAL record and quickly finish the checkpoint
+	 * before the insertion of the parent pointer has been WAL-logged. Replay
+	 * from such a checkpoint would not know that the split is incomplete.
+	 */
+	if (!rel->rd_istemp)
+		MyProc->inCommit = true;
+
+	/*
 	 * By here, the original data page has been split into two new halves, and
 	 * these are correct.  The algorithm requires that the left page never
 	 * move during a split, so we copy the new left page back on top of the
@@ -1683,7 +1701,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively update the parent */
 		_bt_insertonpg(rel, pbuf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
-					   is_only);
+					   is_only, false);
 
 		/* be tidy */
 		pfree(new_item);
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0822f5c..bbe01e1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -1257,11 +1257,3 @@ btree_xlog_cleanup(void)
 	}
 	incomplete_actions = NIL;
 }
-
-bool
-btree_safe_restartpoint(void)
-{
-	if (incomplete_actions)
-		return false;
-	return true;
-}
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eaac139..c72b534 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -542,6 +542,19 @@ replay code has to do the insertion on its own to restore the index to
 consistency.  Such insertions occur after WAL is operational, so they can
 and should write WAL records for the additional generated actions.
 
+Finishing incomplete actions at the end of recovery only works if the recovery
+sees the WAL record starting the action. Because of that, you need to ensure
+that a checkpoint doesn't happen in the middle of such an operation. To be
+more precise, a checkpoint mustn't begin and finish so that there a
+multi-record WAL action is in progress, but no WAL record part of the action
+lies between the REDO pointer and the checkpoint record. Recovery from such a
+checkpoint would not see any evidence of the still incomplete multi-record
+action, so it would not know that it needs to be finished at the end of
+recovery. To prevent that, you must set MyProc->inCommit before writing the
+first WAL record of the operation, and clear it after the operation is
+finished. The checkpoint code checks that after deciding the REDO pointer,
+waiting until it sees that the inCommit flag of every backend is clear before
+writing the checkpoint record.
 
 Write-Ahead Logging for Filesystem Actions
 ------------------------------------------
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index c706e97..c3fcfb4 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -26,20 +26,20 @@
 
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
-	{"XLOG", xlog_redo, xlog_desc, NULL, NULL, NULL},
-	{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},
-	{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},
-	{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},
-	{"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},
-	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
-	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
-	{"RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL},
-	{"Standby", standby_redo, standby_desc, NULL, NULL, NULL},
-	{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
-	{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
-	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
-	{"Hash", hash_redo, hash_desc, NULL, NULL, NULL},
-	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint},
-	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, gist_safe_restartpoint},
-	{"Sequence", seq_redo, seq_desc, NULL, NULL, NULL}
+	{"XLOG", xlog_redo, xlog_desc, NULL, NULL},
+	{"Transaction", xact_redo, xact_desc, NULL, NULL},
+	{"Storage", smgr_redo, smgr_desc, NULL, NULL},
+	{"CLOG", clog_redo, clog_desc, NULL, NULL},
+	{"Database", dbase_redo, dbase_desc, NULL, NULL},
+	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL},
+	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL},
+	{"RelMap", relmap_redo, relmap_desc, NULL, NULL},
+	{"Standby", standby_redo, standby_desc, NULL, NULL},
+	{"Heap2", heap2_redo, heap2_desc, NULL, NULL},
+	{"Heap", heap_redo, heap_desc, NULL, NULL},
+	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup},
+	{"Hash", hash_redo, hash_desc, NULL, NULL},
+	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup},
+	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup},
+	{"Sequence", seq_redo, seq_desc, NULL, NULL}
 };
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 786d0c6..15091af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7302,8 +7302,11 @@ CreateCheckPoint(int flags)
 	nInCommit = GetTransactionsInCommit(&inCommitXids);
 	if (nInCommit > 0)
 	{
+		int i = 0;
 		do
 		{
+			if ((++i) % 10 == 0)
+				elog(LOG, "waiting, %d try", i);
 			pg_usleep(10000L);	/* wait for 10 msec */
 		} while (HaveTransactionsInCommit(inCommitXids, nInCommit));
 	}
@@ -7554,31 +7557,10 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 static void
 RecoveryRestartPoint(const CheckPoint *checkPoint)
 {
-	int			rmid;
-
 	/* use volatile pointer to prevent code rearrangement */
 	volatile XLogCtlData *xlogctl = XLogCtl;
 
 	/*
-	 * Is it safe to checkpoint?  We must ask each of the resource managers
-	 * whether they have any partial state information that might prevent a
-	 * correct restart from this point.  If so, we skip this opportunity, but
-	 * return at the next checkpoint record for another try.
-	 */
-	for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
-	{
-		if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
-			if (!(RmgrTable[rmid].rm_safe_restartpoint()))
-			{
-				elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X",
-					 rmid,
-					 checkPoint->redo.xlogid,
-					 checkPoint->redo.xrecoff);
-				return;
-			}
-	}
-
-	/*
 	 * Copy the checkpoint record to shared memory, so that bgwriter can use
 	 * it the next time it wants to perform a restartpoint.
 	 */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e2d7b45..b5c62df 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -400,7 +400,6 @@ extern void gin_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void gin_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
-extern bool gin_safe_restartpoint(void);
 
 /* ginbtree.c */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 34cc5d5..9f0b18c 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -252,7 +252,6 @@ extern void gist_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void gist_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
-extern bool gist_safe_restartpoint(void);
 extern IndexTuple gist_form_invalid_tuple(BlockNumber blkno);
 
 extern XLogRecData *formUpdateRdata(RelFileNode node, Buffer buffer,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3bbc4d1..4fdc25d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -647,6 +647,5 @@ extern void btree_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern void btree_xlog_startup(void);
 extern void btree_xlog_cleanup(void);
-extern bool btree_safe_restartpoint(void);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 370c989..866b0d2 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -250,7 +250,6 @@ typedef struct RmgrData
 	void		(*rm_desc) (StringInfo buf, uint8 xl_info, char *rec);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
-	bool		(*rm_safe_restartpoint) (void);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to