On 22.10.2013 19:55, Heikki Linnakangas wrote:
I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate "the downlink hasn't been inserted
yet". When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.
This is what I came up with.
One thing I'm not totally happy about is the way page deletions of
incompletely split pages are handled. Basically, it just bails out and
refuses to delete a page that is part of an incomplete split. That's
probably OK in practice, as incomplete splits should be very rare
anyway, but it's a bit dissatisfying to not handle the case because at
first glance it seems like it should be even simpler than usual to
delete a page that has no downlink. Nevertheless, I decided to just skip
that for now.
After this patch, deleting the only child of a parent and the parent
itself is still a multi-WAL-record operation that needs to be tracked
during recovery, and completed at the end of recovery. I'd like to
eliminate that too, but that's another patch.
- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..29d8bd1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -384,12 +384,41 @@ an additional insertion above that, etc).
For a root split, the followon WAL entry is a "new root" entry rather than
an "insertion" entry, but details are otherwise much the same.
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course). This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because insertion involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks along the way, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra writes in what would otherwise
+be a read-only operation (it would not be possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is atomically cleared.
+The child page is kept locked until the insertion in the parent is finished
+and the flag in the child cleared, but can be released immediately after
+that, before recursing up the tree, if the parent also needs to be split.
+That ensures that incompletely split pages should not be seen under normal
+circumstances; only when a transaction fails to insert the parent for some
+reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
+
+We used to keep track of incomplete splits during recovery and finish them
+immediately at end of recovery, instead of doing it lazily at the next
+insertion. However, that made the recovery much more complicated, and only
+fixed the problem when crash recovery was performed. An incomplete split can
+also occur if an otherwise recoverable error, like out-of-memory or
+out-of-disk-space, happens while inserting the downlink to the parent.
When splitting a non-root page that is alone on its level, the required
metapage update (of the "fast root" link) is performed and logged as part
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a452fea..a810015 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -59,14 +59,16 @@ static void _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
- OffsetNumber newitemoff, Size newitemsz,
+static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
+ OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
IndexTuple newitem, bool newitemonleft);
+static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
+ BTStack stack, bool is_root, bool is_only);
static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff,
Size newitemsz,
@@ -130,7 +132,8 @@ top:
* move right in the tree. See Lehman and Yao for an excruciatingly
* precise description.
*/
- buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+ true, stack, BT_WRITE);
/*
* If we're not allowing duplicates, make sure the key isn't already in
@@ -184,7 +187,7 @@ top:
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
- _bt_insertonpg(rel, buf, stack, itup, offset, false);
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
@@ -664,6 +667,10 @@ _bt_findinsertloc(Relation rel,
* insertion, and the buffer must be pinned and write-locked. On return,
* we will have dropped both the pin and the lock on the buffer.
*
+ * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* The locking interactions in this code are critical. You should
* grok Lehman and Yao's paper before making any changes. In addition,
* you need to understand how we disambiguate duplicate keys in this
@@ -677,6 +684,7 @@ _bt_findinsertloc(Relation rel,
static void
_bt_insertonpg(Relation rel,
Buffer buf,
+ Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
@@ -690,6 +698,11 @@ _bt_insertonpg(Relation rel,
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* the caller should've finished any incomplete splits already */
+ if (P_INCOMPLETE_SPLIT(lpageop))
+ elog(ERROR, "cannot insert to incompletely-split page %u",
+ BufferGetBlockNumber(buf));
+
itemsz = IndexTupleDSize(*itup);
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
@@ -714,7 +727,7 @@ _bt_insertonpg(Relation rel,
&newitemonleft);
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, buf, firstright,
+ rbuf = _bt_split(rel, buf, cbuf, firstright,
newitemoff, itemsz, itup, newitemonleft);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
@@ -788,15 +801,25 @@ _bt_insertonpg(Relation rel,
MarkBufferDirty(metabuf);
}
+ /* clear INCOMPLETE_SPLIT flag on child if this finishes a split */
+ if (!P_ISLEAF(lpageop))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_insert xlrec;
- BlockNumber xldownlink;
+ BlockNumber xlleftchild;
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
- XLogRecData rdata[4];
+ XLogRecData rdata[5];
XLogRecData *nextrdata;
IndexTupleData trunctuple;
@@ -812,12 +835,15 @@ _bt_insertonpg(Relation rel,
xlinfo = XLOG_BTREE_INSERT_LEAF;
else
{
- xldownlink = ItemPointerGetBlockNumber(&(itup->t_tid));
- Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
-
- nextrdata->data = (char *) &xldownlink;
+ /*
+ * Include the block number of the left child, whose
+ * INCOMPLETE_SPLIT flag is cleared.
+ */
+ xlleftchild = BufferGetBlockNumber(cbuf);
+ nextrdata->data = (char *) &xlleftchild;
nextrdata->len = sizeof(BlockNumber);
- nextrdata->buffer = InvalidBuffer;
+ nextrdata->buffer = cbuf;
+ nextrdata->buffer_std = true;
nextrdata->next = nextrdata + 1;
nextrdata++;
@@ -870,6 +896,8 @@ _bt_insertonpg(Relation rel,
END_CRIT_SECTION();
/* release buffers; send out relcache inval if metapage changed */
+ if (!P_ISLEAF(lpageop))
+ _bt_relbuf(rel, cbuf);
if (BufferIsValid(metabuf))
{
if (!InRecovery)
@@ -889,11 +917,15 @@ _bt_insertonpg(Relation rel,
* new right page. newitemoff etc. tell us about the new item that
* must be inserted along with the data from the old page.
*
+ * When splitting a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
-_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
bool newitemonleft)
{
@@ -961,6 +993,8 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lopaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
ropaque->btpo_flags = lopaque->btpo_flags;
+ /* set flag in left page indicating that the right page has no downlink */
+ lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev;
lopaque->btpo_next = rightpagenumber;
ropaque->btpo_prev = origpagenumber;
@@ -1184,6 +1218,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
MarkBufferDirty(sbuf);
}
+ /*
+ * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
+ * a split.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
@@ -1206,34 +1252,6 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata = &rdata[0];
- if (ropaque->btpo.level > 0)
- {
- /* Log downlink on non-leaf pages */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
- lastrdata->len = sizeof(BlockIdData);
- lastrdata->buffer = InvalidBuffer;
-
- /*
- * We must also log the left page's high key, because the right
- * page's leftmost key is suppressed on non-leaf levels. Show it
- * as belonging to the left page buffer, so that it is not stored
- * if XLogInsert decides it needs a full-page image of the left
- * page.
- */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- itemid = PageGetItemId(origpage, P_HIKEY);
- item = (IndexTuple) PageGetItem(origpage, itemid);
- lastrdata->data = (char *) item;
- lastrdata->len = MAXALIGN(IndexTupleSize(item));
- lastrdata->buffer = buf; /* backup block 1 */
- lastrdata->buffer_std = true;
- }
-
/*
* Log the new item and its offset, if it was inserted on the left
* page. (If it was put on the right page, we don't need to explicitly
@@ -1260,17 +1278,40 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->buffer = buf; /* backup block 1 */
lastrdata->buffer_std = true;
}
- else if (ropaque->btpo.level == 0)
+
+ /* Log left page */
+ if (ropaque->btpo.level > 0)
{
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
/*
- * Although we don't need to WAL-log the new item, we still need
- * XLogInsert to consider storing a full-page image of the left
- * page, so make an empty entry referencing that buffer. This also
- * ensures that the left page is always backup block 1.
+ * We must also log the left page's high key, because the right
+ * page's leftmost key is suppressed on non-leaf levels. Show it
+ * as belonging to the left page buffer, so that it is not stored
+ * if XLogInsert decides it needs a full-page image of the left
+ * page.
*/
+ itemid = PageGetItemId(origpage, P_HIKEY);
+ item = (IndexTuple) PageGetItem(origpage, itemid);
+ lastrdata->data = (char *) item;
+ lastrdata->len = MAXALIGN(IndexTupleSize(item));
+ lastrdata->buffer = buf; /* backup block 1 */
+ lastrdata->buffer_std = true;
+ }
+
+ if (ropaque->btpo.level == 0 && !newitemonleft)
+ {
lastrdata->next = lastrdata + 1;
lastrdata++;
+ /*
+ * Although we don't need to WAL-log anything on the left page,
+ * the new item, we still need XLogInsert to consider storing a
+ * full-page image of the left page, so make an empty entry
+ * referencing that buffer. This also ensures that the left page
+ * is always backup block 1.
+ */
lastrdata->data = NULL;
lastrdata->len = 0;
lastrdata->buffer = buf; /* backup block 1 */
@@ -1278,6 +1319,22 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
}
/*
+ * Log block number of left child, whose INCOMPLETE_SPLIT flag this
+ * insertion clears.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ BlockNumber cblkno = BufferGetBlockNumber(cbuf);
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
+ lastrdata->data = (char *) &cblkno;
+ lastrdata->len = sizeof(BlockNumber);
+ lastrdata->buffer = cbuf; /* backup block 2 */
+ lastrdata->buffer_std = true;
+ }
+
+ /*
* Log the contents of the right page in the format understood by
* _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
* because we're going to recreate the whole page anyway, so it should
@@ -1306,7 +1363,7 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->data = NULL;
lastrdata->len = 0;
- lastrdata->buffer = sbuf; /* backup block 2 */
+ lastrdata->buffer = sbuf; /* bkp block 2 (leaf) or 3 (non-leaf) */
lastrdata->buffer_std = true;
}
@@ -1333,6 +1390,10 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
if (!P_RIGHTMOST(ropaque))
_bt_relbuf(rel, sbuf);
+ /* release the child */
+ if (ropaque->btpo.level > 0)
+ _bt_relbuf(rel, cbuf);
+
/* split's done */
return rbuf;
}
@@ -1603,10 +1664,8 @@ _bt_checksplitloc(FindSplitData *state,
* have to be efficient (concurrent ROOT split, WAL recovery)
* is_root - we split the true root
* is_only - we split a page alone on its level (might have been fast root)
- *
- * This is exported so it can be called by nbtxlog.c.
*/
-void
+static void
_bt_insert_parent(Relation rel,
Buffer buf,
Buffer rbuf,
@@ -1685,12 +1744,13 @@ _bt_insert_parent(Relation rel,
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- /* Now we can unlock the children */
+ /*
+ * Now we can unlock the right child. The left child will be unlocked
+ * by _bt_insertonpg().
+ */
_bt_relbuf(rel, rbuf);
- _bt_relbuf(rel, buf);
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
@@ -1698,7 +1758,7 @@ _bt_insert_parent(Relation rel,
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
- _bt_insertonpg(rel, pbuf, stack->bts_parent,
+ _bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
is_only);
@@ -1708,6 +1768,56 @@ _bt_insert_parent(Relation rel,
}
/*
+ * _bt_finish_split() -- Finish an incomplete split
+ *
+ * 'lbuf' is the left buffer whose right sibling is missing the downlink.
+ * On entry, we hold write-lock on it, and the lock is released on exit.
+ */
+void
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+{
+ Page lpage = BufferGetPage(lbuf);
+ BTPageOpaque lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+ Buffer rbuf;
+ Page rpage;
+ BTPageOpaque rpageop;
+ bool was_root;
+ bool was_only;
+
+ /* Lock the right sibling, the one missing the downlink */
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+ rpage = BufferGetPage(rbuf);
+ rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+ /* Could this be a root split? */
+ if (!stack)
+ {
+ Buffer metabuf;
+ Page metapg;
+ BTMetaPageData *metad;
+
+ /* acquire lock on the metapage */
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
+
+ was_root = (metad->btm_root == BufferGetBlockNumber(lbuf));
+
+ _bt_relbuf(rel, metabuf);
+ }
+ else
+ was_root = false;
+
+ /* Was this the only page on the level before split? */
+ was_only = (P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop));
+
+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
+
+ _bt_insert_parent(rel, lbuf, rbuf, stack, was_root, was_only);
+}
+
+/*
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
* we last looked at in the parent.
*
@@ -1843,6 +1953,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rbkno;
BlockNumber rootblknum;
BTPageOpaque rootopaque;
+ BTPageOpaque lopaque;
ItemId itemid;
IndexTuple item;
Size itemsz;
@@ -1854,6 +1965,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
lbkno = BufferGetBlockNumber(lbuf);
rbkno = BufferGetBlockNumber(rbuf);
lpage = BufferGetPage(lbuf);
+ lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
/* get a new root page */
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1927,6 +2039,11 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
pfree(new_item);
+ /* Clear the incomplete-split flag in the left child */
+ Assert(P_INCOMPLETE_SPLIT(lopaque));
+ lopaque->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(lbuf);
+
MarkBufferDirty(rootbuf);
MarkBufferDirty(metabuf);
@@ -1935,7 +2052,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
{
xl_btree_newroot xlrec;
XLogRecPtr recptr;
- XLogRecData rdata[2];
+ XLogRecData rdata[3];
xlrec.node = rel->rd_node;
xlrec.rootblk = rootblknum;
@@ -1954,7 +2071,13 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rdata[1].len = ((PageHeader) rootpage)->pd_special -
((PageHeader) rootpage)->pd_upper;
rdata[1].buffer = InvalidBuffer;
- rdata[1].next = NULL;
+ rdata[1].next = &(rdata[2]);
+
+ /* Make a full-page image of the left child if needed */
+ rdata[2].data = NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = lbuf;
+ rdata[2].next = NULL;
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f407753..cbbfc1d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -992,8 +992,15 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_READ);
if (pbuf == InvalidBuffer)
- elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
+ {
+ /*
+ * If the target page's left sibling was incompletely split, it
+ * has no downlink in the parent. Bail out.
+ */
+ elog(DEBUG1, "failed to re-find parent key in index \"%s\" for deletion target page %u",
RelationGetRelationName(rel), target);
+ return false;
+ }
parent = stack->bts_blkno;
poffset = stack->bts_offset;
@@ -1094,11 +1101,16 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
/*
* We can never delete rightmost pages nor root pages. While at it, check
* that page is not already deleted and is empty.
+ *
+ * Also don't delete a page that was incompletely split. We could handle
+ * that case, but it would complicate the logic, and incomplete splits are
+ * rare.
*/
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
- P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+ P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+ P_INCOMPLETE_SPLIT(opaque))
{
/* Should never fail to delete a half-dead page */
Assert(!P_ISHALFDEAD(opaque));
@@ -1223,6 +1235,19 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
page = BufferGetPage(lbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
+
+ /*
+ * If the left sibling was incompletely split, and therefore the target
+ * page has no downlink in the parent, give up. We could handle this
+ * case - in fact deleting a page that has no downlink would probably
+ * be simpler than the general case, as the parent doesn't need to be
+ * updated. But incomplete splits are rare.
+ */
+ if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_relbuf(rel, lbuf);
+ return 0;
+ }
}
else
lbuf = InvalidBuffer;
@@ -1242,7 +1267,8 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
* didn't have it locked; the others are just for paranoia's sake.
*/
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
- P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+ P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+ P_INCOMPLETE_SPLIT(opaque))
{
_bt_relbuf(rel, buf);
if (BufferIsValid(lbuf))
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ac98589..a85b5ee 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -51,7 +51,9 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
- * will result in *bufP being set to InvalidBuffer.
+ * will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
+ * any incomplete splits encountered during the search will be finished by
+ * inserting the missing downlinks.
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
@@ -82,8 +84,13 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage). If it has, we may need to
* move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way.
*/
- *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey, BT_READ);
+ *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+ (access == BT_WRITE), stack_in,
+ BT_READ);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -148,6 +155,11 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* item >= scankey. When nextkey is true, we are looking for the first
* item strictly greater than scankey.
*
+ * If finishsplits is true, we will attempt to finish any incompletely split
+ * pages that we encounter. This is required when searching for a target page
+ * for an insertion, because the split algorithm will fail if it cannot find
+ * the downlink of a page. 'stack' is only used if finishsplits is true.
+ *
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
* the same on the right sibling. Return value is the buffer we stop at.
@@ -158,11 +170,14 @@ _bt_moveright(Relation rel,
int keysz,
ScanKey scankey,
bool nextkey,
+ bool finishsplits,
+ BTStack stack,
int access)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ int curaccess = access;
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -186,12 +201,41 @@ _bt_moveright(Relation rel,
while (!P_RIGHTMOST(opaque) &&
(P_IGNORE(opaque) ||
- _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
+ _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval ||
+ (finishsplits && P_INCOMPLETE_SPLIT(opaque))))
{
- /* step right one page */
- BlockNumber rblkno = opaque->btpo_next;
+ BlockNumber blkno;
+ int needaccess = access;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (finishsplits && P_INCOMPLETE_SPLIT(opaque))
+ {
+ blkno = BufferGetBlockNumber(buf);
+ if (curaccess == BT_READ)
+ {
+ /* re-acquire in write-mode, and recheck */
+ needaccess = BT_WRITE;
+ }
+ else
+ {
+ _bt_finish_split(rel, buf, stack);
+ /*
+ * the buffer was released by _bt_finish_split, we will
+ * re-acquire it in the right mode below.
+ */
+ buf = InvalidBuffer;
+ }
+ }
+ else
+ {
+ /* step right one page */
+ blkno = opaque->btpo_next;
+ }
- buf = _bt_relandgetbuf(rel, buf, rblkno, access);
+ buf = _bt_relandgetbuf(rel, buf, blkno, needaccess);
+ curaccess = needaccess;
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index cb5867e..c934a5e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,77 +21,29 @@
#include "miscadmin.h"
/*
- * We must keep track of expected insertions due to page splits, and apply
- * them manually if they are not seen in the WAL log during replay. This
- * makes it safe for page insertion to be a multiple-WAL-action process.
- *
- * Similarly, deletion of an only child page and deletion of its parent page
+ * Deletion of an only child page and deletion of its parent page
* form multiple WAL log entries, and we have to be prepared to follow through
- * with the deletion if the log ends between.
+ * with the deletion if the log ends between. Keep track of those actions,
*
* The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split or multi deletion to remain incomplete
+ * since we don't expect a multi deletion to remain incomplete
* for long. In any case we need to respect the order of operations.
*/
typedef struct bt_incomplete_action
{
RelFileNode node; /* the index */
- bool is_split; /* T = pending split, F = pending delete */
- /* these fields are for a split: */
- bool is_root; /* we split the root */
- BlockNumber leftblk; /* left half of split */
- BlockNumber rightblk; /* right half of split */
/* these fields are for a delete: */
BlockNumber delblk; /* parent block to be deleted */
} bt_incomplete_action;
static List *incomplete_actions;
-
-static void
-log_incomplete_split(RelFileNode node, BlockNumber leftblk,
- BlockNumber rightblk, bool is_root)
-{
- bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
-
- action->node = node;
- action->is_split = true;
- action->is_root = is_root;
- action->leftblk = leftblk;
- action->rightblk = rightblk;
- incomplete_actions = lappend(incomplete_actions, action);
-}
-
-static void
-forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
-{
- ListCell *l;
-
- foreach(l, incomplete_actions)
- {
- bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
- if (RelFileNodeEquals(node, action->node) &&
- action->is_split &&
- downlink == action->rightblk)
- {
- if (is_root != action->is_root)
- elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
- action->is_root, is_root);
- incomplete_actions = list_delete_ptr(incomplete_actions, action);
- pfree(action);
- break; /* need not look further */
- }
- }
-}
-
static void
log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
{
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
action->node = node;
- action->is_split = false;
action->delblk = delblk;
incomplete_actions = lappend(incomplete_actions, action);
}
@@ -106,7 +58,6 @@ forget_matching_deletion(RelFileNode node, BlockNumber delblk)
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
if (RelFileNodeEquals(node, action->node) &&
- !action->is_split &&
delblk == action->delblk)
{
incomplete_actions = list_delete_ptr(incomplete_actions, action);
@@ -190,23 +141,60 @@ _bt_restore_meta(RelFileNode rnode, XLogRecPtr lsn,
UnlockReleaseBuffer(metabuf);
}
+/*
+ * _bt_clear_incomplete_split -- clear INCOMPLETE_SPLIT flag on a page
+ *
+ * This is a common subroutine of the redo functions of all the WAL record
+ * types that can insert a downlink: insert, split, and newroot.
+ *
+ * Returns a (locked) buffer containing the child page, or InvalidBuffer if
+ * the page was not found (because it was truncated later).
+ */
+static Buffer
+_bt_clear_incomplete_split(XLogRecPtr lsn, XLogRecord *record,
+ RelFileNode rnode, BlockNumber cblock)
+{
+ Buffer cbuf;
+
+ cbuf = XLogReadBuffer(rnode, cblock, false);
+ if (BufferIsValid(cbuf))
+ {
+ Page page = (Page) BufferGetPage(cbuf);
+
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+ pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(cbuf);
+ }
+ }
+
+ return cbuf;
+}
+
static void
btree_xlog_insert(bool isleaf, bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
Buffer buffer;
+ Buffer cbuffer = InvalidBuffer;
Page page;
char *datapos;
int datalen;
xl_btree_metadata md;
- BlockNumber downlink = 0;
+ BlockNumber finishes_split = 0;
+ int blk_index;
datapos = (char *) xlrec + SizeOfBtreeInsert;
datalen = record->xl_len - SizeOfBtreeInsert;
- if (!isleaf)
+ if (!isleaf && (record->xl_info & XLR_BKP_BLOCK(0)) == 0)
{
- memcpy(&downlink, datapos, sizeof(BlockNumber));
+ memcpy(&finishes_split, datapos, sizeof(BlockNumber));
+ Assert(finishes_split != 0);
datapos += sizeof(BlockNumber);
datalen -= sizeof(BlockNumber);
}
@@ -217,8 +205,18 @@ btree_xlog_insert(bool isleaf, bool ismeta,
datalen -= sizeof(xl_btree_metadata);
}
- if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ cbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
+ else
+ cbuffer = _bt_clear_incomplete_split(lsn, record, xlrec->target.node, finishes_split);
+ }
+
+ /* the backup block index containing the main page is 0 or 1 */
+ blk_index = isleaf ? 0 : 1;
+ if (record->xl_info & XLR_BKP_BLOCK(blk_index))
+ (void) RestoreBackupBlock(lsn, record, blk_index, false, false);
else
{
buffer = XLogReadBuffer(xlrec->target.node,
@@ -228,11 +226,7 @@ btree_xlog_insert(bool isleaf, bool ismeta,
{
page = (Page) BufferGetPage(buffer);
- if (lsn <= PageGetLSN(page))
- {
- UnlockReleaseBuffer(buffer);
- }
- else
+ if (lsn > PageGetLSN(page))
{
if (PageAddItem(page, (Item) datapos, datalen,
ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
@@ -241,11 +235,14 @@ btree_xlog_insert(bool isleaf, bool ismeta,
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
}
+ UnlockReleaseBuffer(buffer);
}
}
+ if (BufferIsValid(cbuffer))
+ UnlockReleaseBuffer(cbuffer);
+
/*
* Note: in normal operation, we'd update the metapage while still holding
* lock on the page we inserted into. But during replay it's not
@@ -257,10 +254,6 @@ btree_xlog_insert(bool isleaf, bool ismeta,
_bt_restore_meta(xlrec->target.node, lsn,
md.root, md.level,
md.fastroot, md.fastlevel);
-
- /* Forget any split this insertion completes */
- if (!isleaf)
- forget_matching_split(xlrec->target.node, downlink, false);
}
static void
@@ -268,7 +261,10 @@ btree_xlog_split(bool onleft, bool isroot,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
+ bool isleaf = xlrec->level == 0;
+ Buffer lbuf;
Buffer rbuf;
+ Buffer cbuf = InvalidBuffer;
Page rpage;
BTPageOpaque ropaque;
char *datapos;
@@ -278,42 +274,19 @@ btree_xlog_split(bool onleft, bool isroot,
Size newitemsz = 0;
Item left_hikey = NULL;
Size left_hikeysz = 0;
+ BlockNumber finishes_split = InvalidBlockNumber;
datapos = (char *) xlrec + SizeOfBtreeSplit;
datalen = record->xl_len - SizeOfBtreeSplit;
- /* Forget any split this insertion completes */
- if (xlrec->level > 0)
- {
- /* we assume SizeOfBtreeSplit is at least 16-bit aligned */
- BlockNumber downlink = BlockIdGetBlockNumber((BlockId) datapos);
-
- datapos += sizeof(BlockIdData);
- datalen -= sizeof(BlockIdData);
-
- forget_matching_split(xlrec->node, downlink, false);
-
- /* Extract left hikey and its size (still assuming 16-bit alignment) */
- if (!(record->xl_info & XLR_BKP_BLOCK(0)))
- {
- /* We assume 16-bit alignment is enough for IndexTupleSize */
- left_hikey = (Item) datapos;
- left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-
- datapos += left_hikeysz;
- datalen -= left_hikeysz;
- }
- }
-
- /* Extract newitem and newitemoff, if present */
+ /* Extract newitemoff and newitem, if present */
if (onleft)
{
- /* Extract the offset (still assuming 16-bit alignment) */
+ /* we assume SizeOfBtreeSplit is at least 16-bit aligned */
memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
datapos += sizeof(OffsetNumber);
datalen -= sizeof(OffsetNumber);
}
-
if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
{
/*
@@ -327,6 +300,37 @@ btree_xlog_split(bool onleft, bool isroot,
datalen -= newitemsz;
}
+ /* Extract left hikey and its size (still assuming 16-bit alignment) */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(0)))
+ {
+ left_hikey = (Item) datapos;
+ left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+ datapos += left_hikeysz;
+ datalen -= left_hikeysz;
+ }
+ /*
+ * If this insertion finishes an incomplete split, get the block number
+ * of the child.
+ */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(1)))
+ {
+ memcpy(&finishes_split, datapos, sizeof(BlockNumber));
+ datapos += sizeof(BlockNumber);
+ datalen -= sizeof(BlockNumber);
+ }
+
+ /*
+ * Clear the incomplete split flag on the left sibling of the child page
+ * this is a downlink for.
+ */
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ cbuf = RestoreBackupBlock(lsn, record, 2, false, true);
+ else
+ cbuf = _bt_clear_incomplete_split(lsn, record, xlrec->node, finishes_split);
+ }
+
/* Reconstruct right (new) sibling page from scratch */
rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
Assert(BufferIsValid(rbuf));
@@ -338,7 +342,7 @@ btree_xlog_split(bool onleft, bool isroot,
ropaque->btpo_prev = xlrec->leftsib;
ropaque->btpo_next = xlrec->rnext;
ropaque->btpo.level = xlrec->level;
- ropaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0;
_bt_restore_page(rpage, datapos, datalen);
@@ -347,7 +351,7 @@ btree_xlog_split(bool onleft, bool isroot,
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
- if (xlrec->level == 0)
+ if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
@@ -362,10 +366,10 @@ btree_xlog_split(bool onleft, bool isroot,
/* Now reconstruct left (original) sibling page */
if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ lbuf = RestoreBackupBlock(lsn, record, 0, false, true);
else
{
- Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
+ lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
if (BufferIsValid(lbuf))
{
@@ -422,19 +426,22 @@ btree_xlog_split(bool onleft, bool isroot,
elog(PANIC, "failed to add high key to left page after split");
/* Fix opaque fields */
- lopaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ lopaque->btpo_flags = isleaf ? BTP_LEAF : 0;
+ lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_next = xlrec->rightsib;
lopaque->btpo_cycleid = 0;
PageSetLSN(lpage, lsn);
MarkBufferDirty(lbuf);
}
-
- UnlockReleaseBuffer(lbuf);
}
}
- /* We no longer need the right buffer */
+ /* We no longer need the buffers */
+ if (BufferIsValid(cbuf))
+ UnlockReleaseBuffer(cbuf);
+ if (BufferIsValid(lbuf))
+ UnlockReleaseBuffer(lbuf);
UnlockReleaseBuffer(rbuf);
/*
@@ -445,32 +452,39 @@ btree_xlog_split(bool onleft, bool isroot,
* replay, because no other index update can be in progress, and readers
* will cope properly when following an obsolete left-link.
*/
- if (record->xl_info & XLR_BKP_BLOCK(1))
- (void) RestoreBackupBlock(lsn, record, 1, false, false);
- else if (xlrec->rnext != P_NONE)
+ if (xlrec->rnext != P_NONE)
{
- Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+ /*
+ * the backup block containing right sibling is 2 or 3, depending
+ * whether this was a leaf or internal page.
+ */
+ int rnext_index = isleaf ? 2 : 3;
- if (BufferIsValid(buffer))
+ if (record->xl_info & XLR_BKP_BLOCK(rnext_index))
+ (void) RestoreBackupBlock(lsn, record, rnext_index, false, false);
+ else
{
- Page page = (Page) BufferGetPage(buffer);
+ Buffer buffer;
- if (lsn > PageGetLSN(page))
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+
+ if (BufferIsValid(buffer))
{
- BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Page page = (Page) BufferGetPage(buffer);
- pageop->btpo_prev = xlrec->rightsib;
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
- PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
+ pageop->btpo_prev = xlrec->rightsib;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
}
- UnlockReleaseBuffer(buffer);
}
}
-
- /* The job ain't done till the parent link is inserted... */
- log_incomplete_split(xlrec->node,
- xlrec->leftsib, xlrec->rightsib, isroot);
}
static void
@@ -946,12 +960,11 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_newroot *xlrec = (xl_btree_newroot *) XLogRecGetData(record);
Buffer buffer;
+ Buffer cbuffer = InvalidBuffer;
Page page;
BTPageOpaque pageop;
- BlockNumber downlink = 0;
-
- /* Backup blocks are not used in newroot records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+ BlockNumber cblkno;
+ IndexTuple itup;
buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
Assert(BufferIsValid(buffer));
@@ -969,28 +982,31 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
if (record->xl_len > SizeOfBtreeNewroot)
{
- IndexTuple itup;
-
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeNewroot,
record->xl_len - SizeOfBtreeNewroot);
- /* extract downlink to the right-hand split page */
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
- downlink = ItemPointerGetBlockNumber(&(itup->t_tid));
+ /* extract block number of the left-hand split page */
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ cblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+
+ /* Clear the incomplete-split flag in left child */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ cbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
+ else
+ cbuffer = _bt_clear_incomplete_split(lsn, record,
+ xlrec->node, cblkno);
}
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);
+ if (BufferIsValid(cbuffer))
+ UnlockReleaseBuffer(cbuffer);
_bt_restore_meta(xlrec->node, lsn,
xlrec->rootblk, xlrec->level,
xlrec->rootblk, xlrec->level);
-
- /* Check to see if this satisfies any incomplete insertions */
- if (record->xl_len > SizeOfBtreeNewroot)
- forget_matching_split(xlrec->node, downlink, true);
}
static void
@@ -1084,55 +1100,19 @@ btree_xlog_cleanup(void)
{
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
- if (action->is_split)
+ /* finish an incomplete deletion (of a half-dead page) */
+ Buffer buf;
+
+ buf = XLogReadBuffer(action->node, action->delblk, false);
+ if (BufferIsValid(buf))
{
- /* finish an incomplete split */
- Buffer lbuf,
- rbuf;
- Page lpage,
- rpage;
- BTPageOpaque lpageop,
- rpageop;
- bool is_only;
Relation reln;
- lbuf = XLogReadBuffer(action->node, action->leftblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(lbuf))
- elog(PANIC, "btree_xlog_cleanup: left block unfound");
- lpage = (Page) BufferGetPage(lbuf);
- lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
- rbuf = XLogReadBuffer(action->node, action->rightblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(rbuf))
- elog(PANIC, "btree_xlog_cleanup: right block unfound");
- rpage = (Page) BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
- /* if the pages are all of their level, it's a only-page split */
- is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
reln = CreateFakeRelcacheEntry(action->node);
- _bt_insert_parent(reln, lbuf, rbuf, NULL,
- action->is_root, is_only);
+ if (_bt_pagedel(reln, buf, NULL) == 0)
+ elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
FreeFakeRelcacheEntry(reln);
}
- else
- {
- /* finish an incomplete deletion (of a half-dead page) */
- Buffer buf;
-
- buf = XLogReadBuffer(action->node, action->delblk, false);
- if (BufferIsValid(buf))
- {
- Relation reln;
-
- reln = CreateFakeRelcacheEntry(action->node);
- if (_bt_pagedel(reln, buf, NULL) == 0)
- elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
- FreeFakeRelcacheEntry(reln);
- }
- }
}
incomplete_actions = NIL;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3997f94..803073f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -73,6 +73,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
+#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
@@ -178,6 +179,7 @@ typedef struct BTMetaPageData
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
+#define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -254,7 +256,7 @@ typedef struct xl_btree_metadata
typedef struct xl_btree_insert
{
xl_btreetid target; /* inserted tuple id */
- /* BlockNumber downlink field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
+ /* BlockNumber finishes_split field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
} xl_btree_insert;
@@ -287,19 +289,18 @@ typedef struct xl_btree_split
OffsetNumber firstright; /* first item moved to right page */
/*
- * If level > 0, BlockIdData downlink follows. (We use BlockIdData rather
- * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
- * aligned.)
+ * In the _L variants, next are OffsetNumber newitemoff and the new item.
+ * (In the _R variants, the new item is one of the right page's tuples.)
+ * The new item, but not newitemoff, is suppressed if XLogInsert chooses
+ * to store the left page's whole page image.
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in the new right page. Also, it's suppressed if
* XLogInsert chooses to store the left page's whole page image.
*
- * In the _L variants, next are OffsetNumber newitemoff and the new item.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * The new item, but not newitemoff, is suppressed if XLogInsert chooses
- * to store the left page's whole page image.
+ * If level > 0, BlockNumber of the page whose incomplete-split flag
+ * this insertion clears. (not aligned)
*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
@@ -617,8 +618,7 @@ extern Datum btoptions(PG_FUNCTION_ARGS);
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
-extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
- BTStack stack, bool is_root, bool is_only);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
/*
* prototypes for functions in nbtpage.c
@@ -648,7 +648,8 @@ extern BTStack _bt_search(Relation rel,
int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey, int access);
+ ScanKey scankey, bool nextkey,
+ bool finishsplits, BTStack stack, int access);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers