On 22.10.2013 19:55, Heikki Linnakangas wrote:
I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate "the downlink hasn't been inserted
yet". When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.

This is what I came up with.

One thing I'm not totally happy about is the way page deletions of incompletely split pages are handled. Basically, it just bails out and refuses to delete a page that is part of an incomplete split. That's probably OK in practice, as incomplete splits should be very rare anyway, but it's a bit dissatisfying to not handle the case because at first glance it seems like it should be even simpler than usual to delete a page that has no downlink. Nevertheless, I decided to just skip that for now.

After this patch, deleting the only child of a parent and the parent itself is still a multi-WAL-record operation that needs to be tracked during recovery, and completed at the end of recovery. I'd like to eliminate that too, but that's another patch.

- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..29d8bd1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -384,12 +384,41 @@ an additional insertion above that, etc).
 For a root split, the followon WAL entry is a "new root" entry rather than
 an "insertion" entry, but details are otherwise much the same.
 
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course).  This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because insertion involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks along the way, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra writes in what would otherwise
+be a read-only operation (it would not be possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is atomically cleared.
+The child page is kept locked until the insertion in the parent is finished
+and the flag in the child cleared, but can be released immediately after
+that, before recursing up the tree, if the parent also needs to be split.
+That ensures that incompletely split pages should not be seen under normal
+circumstances; only when a transaction fails to insert the parent for some
+reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
+
+We used to keep track of incomplete splits during recovery and finish them
+immediately at end of recovery, instead of doing it lazily at the next
+insertion. However, that made the recovery much more complicated, and only
+fixed the problem when crash recovery was performed. An incomplete split can
+also occur if an otherwise recoverable error, like out-of-memory or
+out-of-disk-space, happens while inserting the downlink to the parent.
 
 When splitting a non-root page that is alone on its level, the required
 metapage update (of the "fast root" link) is performed and logged as part
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a452fea..a810015 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -59,14 +59,16 @@ static void _bt_findinsertloc(Relation rel,
 				  ScanKey scankey,
 				  IndexTuple newtup,
 				  Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz,
+static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
 		  IndexTuple newitem, bool newitemonleft);
+static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
+				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
@@ -130,7 +132,8 @@ top:
 	 * move right in the tree.	See Lehman and Yao for an excruciatingly
 	 * precise description.
 	 */
-	buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
+	buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+						true, stack, BT_WRITE);
 
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
@@ -184,7 +187,7 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
-		_bt_insertonpg(rel, buf, stack, itup, offset, false);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
 	}
 	else
 	{
@@ -664,6 +667,10 @@ _bt_findinsertloc(Relation rel,
  *		insertion, and the buffer must be pinned and write-locked.	On return,
  *		we will have dropped both the pin and the lock on the buffer.
  *
+ *		When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
+ *		page we're inserting the downlink for. This function will clear the
+ *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
  *		The locking interactions in this code are critical.  You should
  *		grok Lehman and Yao's paper before making any changes.  In addition,
  *		you need to understand how we disambiguate duplicate keys in this
@@ -677,6 +684,7 @@ _bt_findinsertloc(Relation rel,
 static void
 _bt_insertonpg(Relation rel,
 			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
@@ -690,6 +698,11 @@ _bt_insertonpg(Relation rel,
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/* the caller should've finished any incomplete splits already */
+	if (P_INCOMPLETE_SPLIT(lpageop))
+		elog(ERROR, "cannot insert to incompletely-split page %u",
+			 BufferGetBlockNumber(buf));
+
 	itemsz = IndexTupleDSize(*itup);
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
@@ -714,7 +727,7 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, firstright,
+		rbuf = _bt_split(rel, buf, cbuf, firstright,
 						 newitemoff, itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
@@ -788,15 +801,25 @@ _bt_insertonpg(Relation rel,
 			MarkBufferDirty(metabuf);
 		}
 
+		/* clear INCOMPLETE_SPLIT flag on child if this finishes a split */
+		if (!P_ISLEAF(lpageop))
+		{
+			Page		cpage = BufferGetPage(cbuf);
+			BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+			Assert(P_INCOMPLETE_SPLIT(cpageop));
+			cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+			MarkBufferDirty(cbuf);
+		}
+
 		/* XLOG stuff */
 		if (RelationNeedsWAL(rel))
 		{
 			xl_btree_insert xlrec;
-			BlockNumber xldownlink;
+			BlockNumber xlleftchild;
 			xl_btree_metadata xlmeta;
 			uint8		xlinfo;
 			XLogRecPtr	recptr;
-			XLogRecData rdata[4];
+			XLogRecData rdata[5];
 			XLogRecData *nextrdata;
 			IndexTupleData trunctuple;
 
@@ -812,12 +835,15 @@ _bt_insertonpg(Relation rel,
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
 			else
 			{
-				xldownlink = ItemPointerGetBlockNumber(&(itup->t_tid));
-				Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
-
-				nextrdata->data = (char *) &xldownlink;
+				/*
+				 * Include the block number of the left child, whose
+				 * INCOMPLETE_SPLIT flag is cleared.
+				 */
+				xlleftchild = BufferGetBlockNumber(cbuf);
+				nextrdata->data = (char *) &xlleftchild;
 				nextrdata->len = sizeof(BlockNumber);
-				nextrdata->buffer = InvalidBuffer;
+				nextrdata->buffer = cbuf;
+				nextrdata->buffer_std = true;
 				nextrdata->next = nextrdata + 1;
 				nextrdata++;
 
@@ -870,6 +896,8 @@ _bt_insertonpg(Relation rel,
 		END_CRIT_SECTION();
 
 		/* release buffers; send out relcache inval if metapage changed */
+		if (!P_ISLEAF(lpageop))
+			_bt_relbuf(rel, cbuf);
 		if (BufferIsValid(metabuf))
 		{
 			if (!InRecovery)
@@ -889,11 +917,15 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
+ *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
+ *		page we're inserting the downlink for. This function will clear the
+ *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  bool newitemonleft)
 {
@@ -961,6 +993,8 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 	lopaque->btpo_flags = oopaque->btpo_flags;
 	lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
 	ropaque->btpo_flags = lopaque->btpo_flags;
+	/* set flag in left page indicating that the right page has no downlink */
+	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 	lopaque->btpo_prev = oopaque->btpo_prev;
 	lopaque->btpo_next = rightpagenumber;
 	ropaque->btpo_prev = origpagenumber;
@@ -1184,6 +1218,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 		MarkBufferDirty(sbuf);
 	}
 
+	/*
+	 * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
+	 * a split.
+	 */
+	if (ropaque->btpo.level > 0)
+	{
+		Page		cpage = BufferGetPage(cbuf);
+		BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+		cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+		MarkBufferDirty(cbuf);
+	}
+
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
@@ -1206,34 +1252,6 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 
 		lastrdata = &rdata[0];
 
-		if (ropaque->btpo.level > 0)
-		{
-			/* Log downlink on non-leaf pages */
-			lastrdata->next = lastrdata + 1;
-			lastrdata++;
-
-			lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
-			lastrdata->len = sizeof(BlockIdData);
-			lastrdata->buffer = InvalidBuffer;
-
-			/*
-			 * We must also log the left page's high key, because the right
-			 * page's leftmost key is suppressed on non-leaf levels.  Show it
-			 * as belonging to the left page buffer, so that it is not stored
-			 * if XLogInsert decides it needs a full-page image of the left
-			 * page.
-			 */
-			lastrdata->next = lastrdata + 1;
-			lastrdata++;
-
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			lastrdata->data = (char *) item;
-			lastrdata->len = MAXALIGN(IndexTupleSize(item));
-			lastrdata->buffer = buf;	/* backup block 1 */
-			lastrdata->buffer_std = true;
-		}
-
 		/*
 		 * Log the new item and its offset, if it was inserted on the left
 		 * page. (If it was put on the right page, we don't need to explicitly
@@ -1260,17 +1278,40 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 			lastrdata->buffer = buf;	/* backup block 1 */
 			lastrdata->buffer_std = true;
 		}
-		else if (ropaque->btpo.level == 0)
+
+		/* Log left page */
+		if (ropaque->btpo.level > 0)
 		{
+			lastrdata->next = lastrdata + 1;
+			lastrdata++;
+
 			/*
-			 * Although we don't need to WAL-log the new item, we still need
-			 * XLogInsert to consider storing a full-page image of the left
-			 * page, so make an empty entry referencing that buffer. This also
-			 * ensures that the left page is always backup block 1.
+			 * We must also log the left page's high key, because the right
+			 * page's leftmost key is suppressed on non-leaf levels.  Show it
+			 * as belonging to the left page buffer, so that it is not stored
+			 * if XLogInsert decides it needs a full-page image of the left
+			 * page.
 			 */
+			itemid = PageGetItemId(origpage, P_HIKEY);
+			item = (IndexTuple) PageGetItem(origpage, itemid);
+			lastrdata->data = (char *) item;
+			lastrdata->len = MAXALIGN(IndexTupleSize(item));
+			lastrdata->buffer = buf;	/* backup block 1 */
+			lastrdata->buffer_std = true;
+		}
+
+		if (ropaque->btpo.level == 0 && !newitemonleft)
+		{
 			lastrdata->next = lastrdata + 1;
 			lastrdata++;
 
+			/*
+			 * Although we don't need to WAL-log anything on the left page,
+			 * the new item, we still need XLogInsert to consider storing a
+			 * full-page image of the left page, so make an empty entry
+			 * referencing that buffer. This also ensures that the left page
+			 * is always backup block 1.
+			 */
 			lastrdata->data = NULL;
 			lastrdata->len = 0;
 			lastrdata->buffer = buf;	/* backup block 1 */
@@ -1278,6 +1319,22 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 		}
 
 		/*
+		 * Log block number of left child, whose INCOMPLETE_SPLIT flag this
+		 * insertion clears.
+		 */
+		if (ropaque->btpo.level > 0)
+		{
+			BlockNumber cblkno = BufferGetBlockNumber(cbuf);
+			lastrdata->next = lastrdata + 1;
+			lastrdata++;
+
+			lastrdata->data = (char *) &cblkno;
+			lastrdata->len = sizeof(BlockNumber);
+			lastrdata->buffer = cbuf;	/* backup block 2 */
+			lastrdata->buffer_std = true;
+		}
+
+		/*
 		 * Log the contents of the right page in the format understood by
 		 * _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
 		 * because we're going to recreate the whole page anyway, so it should
@@ -1306,7 +1363,7 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 
 			lastrdata->data = NULL;
 			lastrdata->len = 0;
-			lastrdata->buffer = sbuf;	/* backup block 2 */
+			lastrdata->buffer = sbuf;	/* bkp block 2 (leaf) or 3 (non-leaf) */
 			lastrdata->buffer_std = true;
 		}
 
@@ -1333,6 +1390,10 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 	if (!P_RIGHTMOST(ropaque))
 		_bt_relbuf(rel, sbuf);
 
+	/* release the child */
+	if (ropaque->btpo.level > 0)
+		_bt_relbuf(rel, cbuf);
+
 	/* split's done */
 	return rbuf;
 }
@@ -1603,10 +1664,8 @@ _bt_checksplitloc(FindSplitData *state,
  *			have to be efficient (concurrent ROOT split, WAL recovery)
  * is_root - we split the true root
  * is_only - we split a page alone on its level (might have been fast root)
- *
- * This is exported so it can be called by nbtxlog.c.
  */
-void
+static void
 _bt_insert_parent(Relation rel,
 				  Buffer buf,
 				  Buffer rbuf,
@@ -1685,12 +1744,13 @@ _bt_insert_parent(Relation rel,
 		 * 05/27/97
 		 */
 		ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
 		pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
 
-		/* Now we can unlock the children */
+		/*
+		 * Now we can unlock the right child. The left child will be unlocked
+		 * by _bt_insertonpg().
+		 */
 		_bt_relbuf(rel, rbuf);
-		_bt_relbuf(rel, buf);
 
 		/* Check for error only after writing children */
 		if (pbuf == InvalidBuffer)
@@ -1698,7 +1758,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, stack->bts_parent,
+		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -1708,6 +1768,56 @@ _bt_insert_parent(Relation rel,
 }
 
 /*
+ * _bt_finish_split() -- Finish an incomplete split
+ *
+ * 'lbuf' is the left buffer whose right sibling is missing the downlink.
+ * On entry, we hold write-lock on it, and the lock is released on exit.
+ */
+void
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+{
+	Page		lpage = BufferGetPage(lbuf);
+	BTPageOpaque lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+	Buffer		rbuf;
+	Page		rpage;
+	BTPageOpaque rpageop;
+	bool		was_root;
+	bool		was_only;
+
+	/* Lock the right sibling, the one missing the downlink */
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rpage = BufferGetPage(rbuf);
+	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+	/* Could this be a root split? */
+	if (!stack)
+	{
+		Buffer		metabuf;
+		Page		metapg;
+		BTMetaPageData *metad;
+
+		/* acquire lock on the metapage */
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metapg = BufferGetPage(metabuf);
+		metad = BTPageGetMeta(metapg);
+
+		was_root = (metad->btm_root == BufferGetBlockNumber(lbuf));
+
+		_bt_relbuf(rel, metabuf);
+	}
+	else
+		was_root = false;
+
+	/* Was this the only page on the level before split? */
+	was_only = (P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop));
+
+	elog(DEBUG1, "finishing incomplete split of %u/%u",
+		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
+
+	_bt_insert_parent(rel, lbuf, rbuf, stack, was_root, was_only);
+}
+
+/*
  *	_bt_getstackbuf() -- Walk back up the tree one step, and find the item
  *						 we last looked at in the parent.
  *
@@ -1843,6 +1953,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 				rbkno;
 	BlockNumber rootblknum;
 	BTPageOpaque rootopaque;
+	BTPageOpaque lopaque;
 	ItemId		itemid;
 	IndexTuple	item;
 	Size		itemsz;
@@ -1854,6 +1965,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
 	lpage = BufferGetPage(lbuf);
+	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
 	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1927,6 +2039,11 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 			 BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
 	pfree(new_item);
 
+	/* Clear the incomplete-split flag in the left child */
+	Assert(P_INCOMPLETE_SPLIT(lopaque));
+	lopaque->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+	MarkBufferDirty(lbuf);
+
 	MarkBufferDirty(rootbuf);
 	MarkBufferDirty(metabuf);
 
@@ -1935,7 +2052,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	{
 		xl_btree_newroot xlrec;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[2];
+		XLogRecData rdata[3];
 
 		xlrec.node = rel->rd_node;
 		xlrec.rootblk = rootblknum;
@@ -1954,7 +2071,13 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		rdata[1].len = ((PageHeader) rootpage)->pd_special -
 			((PageHeader) rootpage)->pd_upper;
 		rdata[1].buffer = InvalidBuffer;
-		rdata[1].next = NULL;
+		rdata[1].next = &(rdata[2]);
+
+		/* Make a full-page image of the left child if needed */
+		rdata[2].data = NULL;
+		rdata[2].len = 0;
+		rdata[2].buffer = lbuf;
+		rdata[2].next = NULL;
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f407753..cbbfc1d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -992,8 +992,15 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
 	ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
 	pbuf = _bt_getstackbuf(rel, stack, BT_READ);
 	if (pbuf == InvalidBuffer)
-		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
+	{
+		/*
+		 * If the target page's left sibling was incompletely split, it
+		 * has no downlink in the parent. Bail out.
+		 */
+		elog(DEBUG1, "failed to re-find parent key in index \"%s\" for deletion target page %u",
 			 RelationGetRelationName(rel), target);
+		return false;
+	}
 	parent = stack->bts_blkno;
 	poffset = stack->bts_offset;
 
@@ -1094,11 +1101,16 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	/*
 	 * We can never delete rightmost pages nor root pages.	While at it, check
 	 * that page is not already deleted and is empty.
+	 *
+	 * Also don't delete a page that was incompletely split. We could handle
+	 * that case, but it would complicate the logic, and incomplete splits are
+	 * rare.
 	 */
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
-		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+		P_INCOMPLETE_SPLIT(opaque))
 	{
 		/* Should never fail to delete a half-dead page */
 		Assert(!P_ISHALFDEAD(opaque));
@@ -1223,6 +1235,19 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			page = BufferGetPage(lbuf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		}
+
+		/*
+		 * If the left sibling was incompletely split, and therefore the target
+		 * page has no downlink in the parent, give up. We could handle this
+		 * case - in fact deleting a page that has no downlink would probably
+		 * be simpler than the general case, as the parent doesn't need to be
+		 * updated. But incomplete splits are rare.
+		 */
+		if (P_INCOMPLETE_SPLIT(opaque))
+		{
+			_bt_relbuf(rel, lbuf);
+			return 0;
+		}
 	}
 	else
 		lbuf = InvalidBuffer;
@@ -1242,7 +1267,8 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	 * didn't have it locked; the others are just for paranoia's sake.
 	 */
 	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
-		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+		P_INCOMPLETE_SPLIT(opaque))
 	{
 		_bt_relbuf(rel, buf);
 		if (BufferIsValid(lbuf))
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ac98589..a85b5ee 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -51,7 +51,9 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
  * NOTE that the returned buffer is read-locked regardless of the access
  * parameter.  However, access = BT_WRITE will allow an empty root page
  * to be created and returned.	When access = BT_READ, an empty index
- * will result in *bufP being set to InvalidBuffer.
+ * will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
+ * any incomplete splits encountered during the search will be finished by
+ * inserting the missing downlinks.
  */
 BTStack
 _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
@@ -82,8 +84,13 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Race -- the page we just grabbed may have split since we read its
 		 * pointer in the parent (or metapage).  If it has, we may need to
 		 * move right to its new sibling.  Do that.
+		 *
+		 * In write-mode, allow _bt_moveright to finish any incomplete splits
+		 * along the way.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey, BT_READ);
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+							  (access == BT_WRITE), stack_in,
+							  BT_READ);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -148,6 +155,11 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * item >= scankey.  When nextkey is true, we are looking for the first
  * item strictly greater than scankey.
  *
+ * If finishsplits is true, we will attempt to finish any incompletely split
+ * pages that we encounter. This is required when searching for a target page
+ * for an insertion, because the split algorithm will fail if it cannot find
+ * the downlink of a page. 'stack' is only used if finishsplits is true.
+ *
  * On entry, we have the buffer pinned and a lock of the type specified by
  * 'access'.  If we move right, we release the buffer and lock and acquire
  * the same on the right sibling.  Return value is the buffer we stop at.
@@ -158,11 +170,14 @@ _bt_moveright(Relation rel,
 			  int keysz,
 			  ScanKey scankey,
 			  bool nextkey,
+			  bool finishsplits,
+			  BTStack stack,
 			  int access)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	int32		cmpval;
+	int			curaccess = access;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -186,12 +201,41 @@ _bt_moveright(Relation rel,
 
 	while (!P_RIGHTMOST(opaque) &&
 		   (P_IGNORE(opaque) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
+			_bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval ||
+			(finishsplits && P_INCOMPLETE_SPLIT(opaque))))
 	{
-		/* step right one page */
-		BlockNumber rblkno = opaque->btpo_next;
+		BlockNumber blkno;
+		int			needaccess = access;
+
+		/*
+		 * Finish any incomplete splits we encounter along the way.
+		 */
+		if (finishsplits && P_INCOMPLETE_SPLIT(opaque))
+		{
+			blkno = BufferGetBlockNumber(buf);
+			if (curaccess == BT_READ)
+			{
+				/* re-acquire in write-mode, and recheck */
+				needaccess = BT_WRITE;
+			}
+			else
+			{
+				_bt_finish_split(rel, buf, stack);
+				/*
+				 * the buffer was released by _bt_finish_split, we will
+				 * re-acquire it in the right mode below.
+				 */
+				buf = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/* step right one page */
+			blkno = opaque->btpo_next;
+		}
 
-		buf = _bt_relandgetbuf(rel, buf, rblkno, access);
+		buf = _bt_relandgetbuf(rel, buf, blkno, needaccess);
+		curaccess = needaccess;
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index cb5867e..c934a5e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,77 +21,29 @@
 #include "miscadmin.h"
 
 /*
- * We must keep track of expected insertions due to page splits, and apply
- * them manually if they are not seen in the WAL log during replay.  This
- * makes it safe for page insertion to be a multiple-WAL-action process.
- *
- * Similarly, deletion of an only child page and deletion of its parent page
+ * Deletion of an only child page and deletion of its parent page
  * form multiple WAL log entries, and we have to be prepared to follow through
- * with the deletion if the log ends between.
+ * with the deletion if the log ends between. Keep track of those actions,
  *
  * The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split or multi deletion to remain incomplete
+ * since we don't expect a multi deletion to remain incomplete
  * for long.  In any case we need to respect the order of operations.
  */
 typedef struct bt_incomplete_action
 {
 	RelFileNode node;			/* the index */
-	bool		is_split;		/* T = pending split, F = pending delete */
-	/* these fields are for a split: */
-	bool		is_root;		/* we split the root */
-	BlockNumber leftblk;		/* left half of split */
-	BlockNumber rightblk;		/* right half of split */
 	/* these fields are for a delete: */
 	BlockNumber delblk;			/* parent block to be deleted */
 } bt_incomplete_action;
 
 static List *incomplete_actions;
 
-
-static void
-log_incomplete_split(RelFileNode node, BlockNumber leftblk,
-					 BlockNumber rightblk, bool is_root)
-{
-	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
-
-	action->node = node;
-	action->is_split = true;
-	action->is_root = is_root;
-	action->leftblk = leftblk;
-	action->rightblk = rightblk;
-	incomplete_actions = lappend(incomplete_actions, action);
-}
-
-static void
-forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
-{
-	ListCell   *l;
-
-	foreach(l, incomplete_actions)
-	{
-		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
-		if (RelFileNodeEquals(node, action->node) &&
-			action->is_split &&
-			downlink == action->rightblk)
-		{
-			if (is_root != action->is_root)
-				elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
-					 action->is_root, is_root);
-			incomplete_actions = list_delete_ptr(incomplete_actions, action);
-			pfree(action);
-			break;				/* need not look further */
-		}
-	}
-}
-
 static void
 log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
 {
 	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
 
 	action->node = node;
-	action->is_split = false;
 	action->delblk = delblk;
 	incomplete_actions = lappend(incomplete_actions, action);
 }
@@ -106,7 +58,6 @@ forget_matching_deletion(RelFileNode node, BlockNumber delblk)
 		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
 
 		if (RelFileNodeEquals(node, action->node) &&
-			!action->is_split &&
 			delblk == action->delblk)
 		{
 			incomplete_actions = list_delete_ptr(incomplete_actions, action);
@@ -190,23 +141,60 @@ _bt_restore_meta(RelFileNode rnode, XLogRecPtr lsn,
 	UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * _bt_clear_incomplete_split -- clear INCOMPLETE_SPLIT flag on a page
+ *
+ * This is a common subroutine of the redo functions of all the WAL record
+ * types that can insert a downlink: insert, split, and newroot.
+ *
+ * Returns a (locked) buffer containing the child page, or InvalidBuffer if
+ * the page was not found (because it was truncated later).
+ */
+static Buffer
+_bt_clear_incomplete_split(XLogRecPtr lsn, XLogRecord *record,
+						   RelFileNode rnode, BlockNumber cblock)
+{
+	Buffer cbuf;
+
+	cbuf = XLogReadBuffer(rnode, cblock, false);
+	if (BufferIsValid(cbuf))
+	{
+		Page		page = (Page) BufferGetPage(cbuf);
+
+		if (lsn > PageGetLSN(page))
+		{
+			BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+			Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+			pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(cbuf);
+		}
+	}
+
+	return cbuf;
+}
+
 static void
 btree_xlog_insert(bool isleaf, bool ismeta,
 				  XLogRecPtr lsn, XLogRecord *record)
 {
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
 	Buffer		buffer;
+	Buffer		cbuffer = InvalidBuffer;
 	Page		page;
 	char	   *datapos;
 	int			datalen;
 	xl_btree_metadata md;
-	BlockNumber downlink = 0;
+	BlockNumber finishes_split = 0;
+	int			blk_index;
 
 	datapos = (char *) xlrec + SizeOfBtreeInsert;
 	datalen = record->xl_len - SizeOfBtreeInsert;
-	if (!isleaf)
+	if (!isleaf && (record->xl_info & XLR_BKP_BLOCK(0)) == 0)
 	{
-		memcpy(&downlink, datapos, sizeof(BlockNumber));
+		memcpy(&finishes_split, datapos, sizeof(BlockNumber));
+		Assert(finishes_split != 0);
 		datapos += sizeof(BlockNumber);
 		datalen -= sizeof(BlockNumber);
 	}
@@ -217,8 +205,18 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 		datalen -= sizeof(xl_btree_metadata);
 	}
 
-	if (record->xl_info & XLR_BKP_BLOCK(0))
-		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+	if (!isleaf)
+	{
+		if (record->xl_info & XLR_BKP_BLOCK(0))
+			cbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
+		else
+			cbuffer = _bt_clear_incomplete_split(lsn, record, xlrec->target.node, finishes_split);
+	}
+
+	/* the backup block index containing the main page is 0 or 1 */
+	blk_index = isleaf ? 0 : 1;
+	if (record->xl_info & XLR_BKP_BLOCK(blk_index))
+		(void) RestoreBackupBlock(lsn, record, blk_index, false, false);
 	else
 	{
 		buffer = XLogReadBuffer(xlrec->target.node,
@@ -228,11 +226,7 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 		{
 			page = (Page) BufferGetPage(buffer);
 
-			if (lsn <= PageGetLSN(page))
-			{
-				UnlockReleaseBuffer(buffer);
-			}
-			else
+			if (lsn > PageGetLSN(page))
 			{
 				if (PageAddItem(page, (Item) datapos, datalen,
 							ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
@@ -241,11 +235,14 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 
 				PageSetLSN(page, lsn);
 				MarkBufferDirty(buffer);
-				UnlockReleaseBuffer(buffer);
 			}
+			UnlockReleaseBuffer(buffer);
 		}
 	}
 
+	if (BufferIsValid(cbuffer))
+		UnlockReleaseBuffer(cbuffer);
+
 	/*
 	 * Note: in normal operation, we'd update the metapage while still holding
 	 * lock on the page we inserted into.  But during replay it's not
@@ -257,10 +254,6 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 		_bt_restore_meta(xlrec->target.node, lsn,
 						 md.root, md.level,
 						 md.fastroot, md.fastlevel);
-
-	/* Forget any split this insertion completes */
-	if (!isleaf)
-		forget_matching_split(xlrec->target.node, downlink, false);
 }
 
 static void
@@ -268,7 +261,10 @@ btree_xlog_split(bool onleft, bool isroot,
 				 XLogRecPtr lsn, XLogRecord *record)
 {
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
+	bool		isleaf = xlrec->level == 0;
+	Buffer		lbuf;
 	Buffer		rbuf;
+	Buffer		cbuf = InvalidBuffer;
 	Page		rpage;
 	BTPageOpaque ropaque;
 	char	   *datapos;
@@ -278,42 +274,19 @@ btree_xlog_split(bool onleft, bool isroot,
 	Size		newitemsz = 0;
 	Item		left_hikey = NULL;
 	Size		left_hikeysz = 0;
+	BlockNumber finishes_split = InvalidBlockNumber;
 
 	datapos = (char *) xlrec + SizeOfBtreeSplit;
 	datalen = record->xl_len - SizeOfBtreeSplit;
 
-	/* Forget any split this insertion completes */
-	if (xlrec->level > 0)
-	{
-		/* we assume SizeOfBtreeSplit is at least 16-bit aligned */
-		BlockNumber downlink = BlockIdGetBlockNumber((BlockId) datapos);
-
-		datapos += sizeof(BlockIdData);
-		datalen -= sizeof(BlockIdData);
-
-		forget_matching_split(xlrec->node, downlink, false);
-
-		/* Extract left hikey and its size (still assuming 16-bit alignment) */
-		if (!(record->xl_info & XLR_BKP_BLOCK(0)))
-		{
-			/* We assume 16-bit alignment is enough for IndexTupleSize */
-			left_hikey = (Item) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
-	}
-
-	/* Extract newitem and newitemoff, if present */
+	/* Extract newitemoff and newitem, if present */
 	if (onleft)
 	{
-		/* Extract the offset (still assuming 16-bit alignment) */
+		/* we assume SizeOfBtreeSplit is at least 16-bit aligned */
 		memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
 		datapos += sizeof(OffsetNumber);
 		datalen -= sizeof(OffsetNumber);
 	}
-
 	if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
 	{
 		/*
@@ -327,6 +300,37 @@ btree_xlog_split(bool onleft, bool isroot,
 		datalen -= newitemsz;
 	}
 
+	/* Extract left hikey and its size (still assuming 16-bit alignment) */
+	if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(0)))
+	{
+		left_hikey = (Item) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
+	}
+	/*
+	 * If this insertion finishes an incomplete split, get the block number
+	 * of the child.
+	 */
+	if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(1)))
+	{
+		memcpy(&finishes_split, datapos, sizeof(BlockNumber));
+		datapos += sizeof(BlockNumber);
+		datalen -= sizeof(BlockNumber);
+	}
+
+	/*
+	 * Clear the incomplete split flag on the left sibling of the child page
+	 * this is a downlink for.
+	 */
+	if (!isleaf)
+	{
+		if (record->xl_info & XLR_BKP_BLOCK(2))
+			cbuf = RestoreBackupBlock(lsn, record, 2, false, true);
+		else
+			cbuf = _bt_clear_incomplete_split(lsn, record, xlrec->node, finishes_split);
+	}
+
 	/* Reconstruct right (new) sibling page from scratch */
 	rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
 	Assert(BufferIsValid(rbuf));
@@ -338,7 +342,7 @@ btree_xlog_split(bool onleft, bool isroot,
 	ropaque->btpo_prev = xlrec->leftsib;
 	ropaque->btpo_next = xlrec->rnext;
 	ropaque->btpo.level = xlrec->level;
-	ropaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+	ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
 	ropaque->btpo_cycleid = 0;
 
 	_bt_restore_page(rpage, datapos, datalen);
@@ -347,7 +351,7 @@ btree_xlog_split(bool onleft, bool isroot,
 	 * On leaf level, the high key of the left page is equal to the first key
 	 * on the right page.
 	 */
-	if (xlrec->level == 0)
+	if (isleaf)
 	{
 		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
 
@@ -362,10 +366,10 @@ btree_xlog_split(bool onleft, bool isroot,
 
 	/* Now reconstruct left (original) sibling page */
 	if (record->xl_info & XLR_BKP_BLOCK(0))
-		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+		lbuf = RestoreBackupBlock(lsn, record, 0, false, true);
 	else
 	{
-		Buffer		lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
+		lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
 
 		if (BufferIsValid(lbuf))
 		{
@@ -422,19 +426,22 @@ btree_xlog_split(bool onleft, bool isroot,
 					elog(PANIC, "failed to add high key to left page after split");
 
 				/* Fix opaque fields */
-				lopaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+				lopaque->btpo_flags = isleaf ? BTP_LEAF : 0;
+				lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 				lopaque->btpo_next = xlrec->rightsib;
 				lopaque->btpo_cycleid = 0;
 
 				PageSetLSN(lpage, lsn);
 				MarkBufferDirty(lbuf);
 			}
-
-			UnlockReleaseBuffer(lbuf);
 		}
 	}
 
-	/* We no longer need the right buffer */
+	/* We no longer need the buffers */
+	if (BufferIsValid(cbuf))
+		UnlockReleaseBuffer(cbuf);
+	if (BufferIsValid(lbuf))
+		UnlockReleaseBuffer(lbuf);
 	UnlockReleaseBuffer(rbuf);
 
 	/*
@@ -445,32 +452,39 @@ btree_xlog_split(bool onleft, bool isroot,
 	 * replay, because no other index update can be in progress, and readers
 	 * will cope properly when following an obsolete left-link.
 	 */
-	if (record->xl_info & XLR_BKP_BLOCK(1))
-		(void) RestoreBackupBlock(lsn, record, 1, false, false);
-	else if (xlrec->rnext != P_NONE)
+	if (xlrec->rnext != P_NONE)
 	{
-		Buffer		buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+		/*
+		 * the backup block containing right sibling is 2 or 3, depending
+		 * whether this was a leaf or internal page.
+		 */
+		int		rnext_index = isleaf ? 2 : 3;
 
-		if (BufferIsValid(buffer))
+		if (record->xl_info & XLR_BKP_BLOCK(rnext_index))
+			(void) RestoreBackupBlock(lsn, record, rnext_index, false, false);
+		else
 		{
-			Page		page = (Page) BufferGetPage(buffer);
+			Buffer		buffer;
 
-			if (lsn > PageGetLSN(page))
+			buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+
+			if (BufferIsValid(buffer))
 			{
-				BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				Page		page = (Page) BufferGetPage(buffer);
 
-				pageop->btpo_prev = xlrec->rightsib;
+				if (lsn > PageGetLSN(page))
+				{
+					BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-				PageSetLSN(page, lsn);
-				MarkBufferDirty(buffer);
+					pageop->btpo_prev = xlrec->rightsib;
+
+					PageSetLSN(page, lsn);
+					MarkBufferDirty(buffer);
+				}
+				UnlockReleaseBuffer(buffer);
 			}
-			UnlockReleaseBuffer(buffer);
 		}
 	}
-
-	/* The job ain't done till the parent link is inserted... */
-	log_incomplete_split(xlrec->node,
-						 xlrec->leftsib, xlrec->rightsib, isroot);
 }
 
 static void
@@ -946,12 +960,11 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
 {
 	xl_btree_newroot *xlrec = (xl_btree_newroot *) XLogRecGetData(record);
 	Buffer		buffer;
+	Buffer		cbuffer = InvalidBuffer;
 	Page		page;
 	BTPageOpaque pageop;
-	BlockNumber downlink = 0;
-
-	/* Backup blocks are not used in newroot records */
-	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+	BlockNumber cblkno;
+	IndexTuple	itup;
 
 	buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
 	Assert(BufferIsValid(buffer));
@@ -969,28 +982,31 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
 
 	if (record->xl_len > SizeOfBtreeNewroot)
 	{
-		IndexTuple	itup;
-
 		_bt_restore_page(page,
 						 (char *) xlrec + SizeOfBtreeNewroot,
 						 record->xl_len - SizeOfBtreeNewroot);
-		/* extract downlink to the right-hand split page */
-		itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
-		downlink = ItemPointerGetBlockNumber(&(itup->t_tid));
+		/* extract block number of the left-hand split page */
+		itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+		cblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
 		Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+
+		/* Clear the incomplete-split flag in left child */
+		if (record->xl_info & XLR_BKP_BLOCK(0))
+			cbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
+		else
+			cbuffer = _bt_clear_incomplete_split(lsn, record,
+												 xlrec->node, cblkno);
 	}
 
 	PageSetLSN(page, lsn);
 	MarkBufferDirty(buffer);
 	UnlockReleaseBuffer(buffer);
+	if (BufferIsValid(cbuffer))
+		UnlockReleaseBuffer(cbuffer);
 
 	_bt_restore_meta(xlrec->node, lsn,
 					 xlrec->rootblk, xlrec->level,
 					 xlrec->rootblk, xlrec->level);
-
-	/* Check to see if this satisfies any incomplete insertions */
-	if (record->xl_len > SizeOfBtreeNewroot)
-		forget_matching_split(xlrec->node, downlink, true);
 }
 
 static void
@@ -1084,55 +1100,19 @@ btree_xlog_cleanup(void)
 	{
 		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
 
-		if (action->is_split)
+		/* finish an incomplete deletion (of a half-dead page) */
+		Buffer		buf;
+
+		buf = XLogReadBuffer(action->node, action->delblk, false);
+		if (BufferIsValid(buf))
 		{
-			/* finish an incomplete split */
-			Buffer		lbuf,
-						rbuf;
-			Page		lpage,
-						rpage;
-			BTPageOpaque lpageop,
-						rpageop;
-			bool		is_only;
 			Relation	reln;
 
-			lbuf = XLogReadBuffer(action->node, action->leftblk, false);
-			/* failure is impossible because we wrote this page earlier */
-			if (!BufferIsValid(lbuf))
-				elog(PANIC, "btree_xlog_cleanup: left block unfound");
-			lpage = (Page) BufferGetPage(lbuf);
-			lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
-			rbuf = XLogReadBuffer(action->node, action->rightblk, false);
-			/* failure is impossible because we wrote this page earlier */
-			if (!BufferIsValid(rbuf))
-				elog(PANIC, "btree_xlog_cleanup: right block unfound");
-			rpage = (Page) BufferGetPage(rbuf);
-			rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
-			/* if the pages are all of their level, it's a only-page split */
-			is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
 			reln = CreateFakeRelcacheEntry(action->node);
-			_bt_insert_parent(reln, lbuf, rbuf, NULL,
-							  action->is_root, is_only);
+			if (_bt_pagedel(reln, buf, NULL) == 0)
+				elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
 			FreeFakeRelcacheEntry(reln);
 		}
-		else
-		{
-			/* finish an incomplete deletion (of a half-dead page) */
-			Buffer		buf;
-
-			buf = XLogReadBuffer(action->node, action->delblk, false);
-			if (BufferIsValid(buf))
-			{
-				Relation	reln;
-
-				reln = CreateFakeRelcacheEntry(action->node);
-				if (_bt_pagedel(reln, buf, NULL) == 0)
-					elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
-				FreeFakeRelcacheEntry(reln);
-			}
-		}
 	}
 	incomplete_actions = NIL;
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3997f94..803073f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -73,6 +73,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_HALF_DEAD	(1 << 4)	/* empty, but still in tree */
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples */
+#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.	This is
@@ -178,6 +179,7 @@ typedef struct BTMetaPageData
 #define P_ISHALFDEAD(opaque)	((opaque)->btpo_flags & BTP_HALF_DEAD)
 #define P_IGNORE(opaque)		((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
 #define P_HAS_GARBAGE(opaque)	((opaque)->btpo_flags & BTP_HAS_GARBAGE)
+#define P_INCOMPLETE_SPLIT(opaque)	((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -254,7 +256,7 @@ typedef struct xl_btree_metadata
 typedef struct xl_btree_insert
 {
 	xl_btreetid target;			/* inserted tuple id */
-	/* BlockNumber downlink field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
+	/* BlockNumber finishes_split field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
 	/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
 } xl_btree_insert;
@@ -287,19 +289,18 @@ typedef struct xl_btree_split
 	OffsetNumber firstright;	/* first item moved to right page */
 
 	/*
-	 * If level > 0, BlockIdData downlink follows.	(We use BlockIdData rather
-	 * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
-	 * aligned.)
+	 * In the _L variants, next are OffsetNumber newitemoff and the new item.
+	 * (In the _R variants, the new item is one of the right page's tuples.)
+	 * The new item, but not newitemoff, is suppressed if XLogInsert chooses
+	 * to store the left page's whole page image.
 	 *
 	 * If level > 0, an IndexTuple representing the HIKEY of the left page
 	 * follows.  We don't need this on leaf pages, because it's the same as
 	 * the leftmost key in the new right page.	Also, it's suppressed if
 	 * XLogInsert chooses to store the left page's whole page image.
 	 *
-	 * In the _L variants, next are OffsetNumber newitemoff and the new item.
-	 * (In the _R variants, the new item is one of the right page's tuples.)
-	 * The new item, but not newitemoff, is suppressed if XLogInsert chooses
-	 * to store the left page's whole page image.
+	 * If level > 0, BlockNumber of the page whose incomplete-split flag
+	 * this insertion clears. (not aligned)
 	 *
 	 * Last are the right page's tuples in the form used by _bt_restore_page.
 	 */
@@ -617,8 +618,7 @@ extern Datum btoptions(PG_FUNCTION_ARGS);
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
-extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-				  BTStack stack, bool is_root, bool is_only);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
 
 /*
  * prototypes for functions in nbtpage.c
@@ -648,7 +648,8 @@ extern BTStack _bt_search(Relation rel,
 		   int keysz, ScanKey scankey, bool nextkey,
 		   Buffer *bufP, int access);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, int access);
+			  ScanKey scankey, bool nextkey,
+			  bool finishsplits, BTStack stack,  int access);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
 			ScanKey scankey, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to