While looking at Alexander's GIN patch, I noticed an ancient bug in the
WAL-logging of GIN entry-tree insertions. entryPlaceToPage and
dataPlacetoPage functions don't make a full-page image of the page, when
inserting a downlink on a non-leaf page. The comment says:
/*
* Prevent full page write if child's split occurs. That is needed to
* remove incomplete splits while replaying WAL
*
* data.updateBlkno contains new block number (of newly created right
* page) for recently splited page.
*/
The code is doing what the comment says, but that's wrong. You can't
just skip the full page write, it's needed for torn page protection like
in any other case. The correct fix would've been to change the
redo-routine to do the incomplete split tracking for the page even if
it's restored from a full page image.
This was broken by this commit back in 2007:
commit 853d1c3103fa961ae6219f0281885b345593d101
Author: Teodor Sigaev <[email protected]>
Date: Mon Jun 4 15:56:28 2007 +0000
Fix bundle bugs of GIN:
- Fix possible deadlock between UPDATE and VACUUM queries. Bug never was
observed in 8.2, but it still exist there. HEAD is more sensitive to
bug after recent "ring" of buffer improvements.
- Fix WAL creation: if parent page is stored as is after split then
incomplete split isn't removed during replay. This happens rather rare,
only
on large tables with a lot of updates/inserts.
- Fix WAL replay: there was wrong test of XLR_BKP_BLOCK_* for left
page after deletion of page. That causes wrong rightlink field: it pointed
to deleted page.
- add checking of match of clearing incomplete split
- cleanup incomplete split list after proceeding
All of this chages doesn't change on-disk storage, so backpatch...
But second point may be an issue for replaying logs from previous version.
The relevant part is the "Fix WAL creation" item. I searched the
archives but couldn't find any discussion leading to this fix.
In 2010, Tom actually fixed the redo-routine in commit
4016bdef8aded77b4903c457050622a5a1815c16, along with other fixes. So all
we need to do now is to fix that bogus logic in entryPlaceToPage to not
skip the full-page-image.
Attached is a patch for 9.3. I've whacked the code a lot in master, but
the same bug is present there too.
- Heikki
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 13ab448..32eba4c 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -381,7 +381,6 @@ dataPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prda
{
Page page = BufferGetPage(buf);
int sizeofitem = GinSizeOfDataPageItem(page);
- int cnt = 0;
/* these must be static so they can be returned to caller */
static XLogRecData rdata[3];
@@ -401,32 +400,25 @@ dataPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prda
data.isLeaf = GinPageIsLeaf(page) ? TRUE : FALSE;
/*
- * Prevent full page write if child's split occurs. That is needed to
- * remove incomplete splits while replaying WAL
- *
- * data.updateBlkno contains new block number (of newly created right
- * page) for recently splited page.
+ * For incomplete-split tracking, we need the updateBlkno information and
+ * the inserted item even when we make a full page image of the page, so
+ * put the buffer reference in a separate XLogRecData entry.
*/
- if (data.updateBlkno == InvalidBlockNumber)
- {
- rdata[0].buffer = buf;
- rdata[0].buffer_std = FALSE;
- rdata[0].data = NULL;
- rdata[0].len = 0;
- rdata[0].next = &rdata[1];
- cnt++;
- }
+ rdata[0].buffer = buf;
+ rdata[0].buffer_std = FALSE;
+ rdata[0].data = NULL;
+ rdata[0].len = 0;
+ rdata[0].next = &rdata[1];
- rdata[cnt].buffer = InvalidBuffer;
- rdata[cnt].data = (char *) &data;
- rdata[cnt].len = sizeof(ginxlogInsert);
- rdata[cnt].next = &rdata[cnt + 1];
- cnt++;
+ rdata[1].buffer = InvalidBuffer;
+ rdata[1].data = (char *) &data;
+ rdata[1].len = sizeof(ginxlogInsert);
+ rdata[1].next = &rdata[2];
- rdata[cnt].buffer = InvalidBuffer;
- rdata[cnt].data = (GinPageIsLeaf(page)) ? ((char *) (btree->items + btree->curitem)) : ((char *) &(btree->pitem));
- rdata[cnt].len = sizeofitem;
- rdata[cnt].next = NULL;
+ rdata[2].buffer = InvalidBuffer;
+ rdata[2].data = (GinPageIsLeaf(page)) ? ((char *) (btree->items + btree->curitem)) : ((char *) &(btree->pitem));
+ rdata[2].len = sizeofitem;
+ rdata[2].next = NULL;
if (GinPageIsLeaf(page))
{
@@ -442,7 +434,7 @@ dataPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prda
btree->curitem++;
}
data.nitem = btree->curitem - savedPos;
- rdata[cnt].len = sizeofitem * data.nitem;
+ rdata[2].len = sizeofitem * data.nitem;
}
else
{
diff --git a/src/backend/access/gin/ginentrypage.c b/src/backend/access/gin/ginentrypage.c
index 8ead38f..f6c0632 100644
--- a/src/backend/access/gin/ginentrypage.c
+++ b/src/backend/access/gin/ginentrypage.c
@@ -486,7 +486,6 @@ entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prd
{
Page page = BufferGetPage(buf);
OffsetNumber placed;
- int cnt = 0;
/* these must be static so they can be returned to caller */
static XLogRecData rdata[3];
@@ -509,32 +508,25 @@ entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prd
data.isLeaf = GinPageIsLeaf(page) ? TRUE : FALSE;
/*
- * Prevent full page write if child's split occurs. That is needed to
- * remove incomplete splits while replaying WAL
- *
- * data.updateBlkno contains new block number (of newly created right
- * page) for recently splited page.
+ * For incomplete-split tracking, we need the updateBlkno information and
+ * the inserted item even when we make a full page image of the page, so
+ * put the buffer reference in a separate XLogRecData entry.
*/
- if (data.updateBlkno == InvalidBlockNumber)
- {
- rdata[0].buffer = buf;
- rdata[0].buffer_std = TRUE;
- rdata[0].data = NULL;
- rdata[0].len = 0;
- rdata[0].next = &rdata[1];
- cnt++;
- }
-
- rdata[cnt].buffer = InvalidBuffer;
- rdata[cnt].data = (char *) &data;
- rdata[cnt].len = sizeof(ginxlogInsert);
- rdata[cnt].next = &rdata[cnt + 1];
- cnt++;
+ rdata[0].buffer = buf;
+ rdata[0].buffer_std = TRUE;
+ rdata[0].data = NULL;
+ rdata[0].len = 0;
+ rdata[0].next = &rdata[1];
- rdata[cnt].buffer = InvalidBuffer;
- rdata[cnt].data = (char *) btree->entry;
- rdata[cnt].len = IndexTupleSize(btree->entry);
- rdata[cnt].next = NULL;
+ rdata[1].buffer = InvalidBuffer;
+ rdata[1].data = (char *) &data;
+ rdata[1].len = sizeof(ginxlogInsert);
+ rdata[1].next = &rdata[2];
+
+ rdata[2].buffer = InvalidBuffer;
+ rdata[2].data = (char *) btree->entry;
+ rdata[2].len = IndexTupleSize(btree->entry);
+ rdata[2].next = NULL;
btree->entry = NULL;
}
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers