When profiling replay the WAL generated by pgbench, I noticed the PageRepairFragmentation consumes a large fraction of the CPU time:

Per "perf report":

+ 33.44% 6.79% postmaster postgres [.] PageRepairFragmentation

The 33.44% figure includes all the functions called by PageRepairFragmentation. Looking at the profile closer, most of that time seems to be spent in sorting the item ids to physical order, so that they can be memmoved in place:

+   83.86%     0.00%  postmaster  libc-2.19.so        [.] __libc_start_main
+   83.86%     0.00%  postmaster  postgres            [.] main
+   83.86%     0.00%  postmaster  postgres            [.] PostmasterMain
+   83.86%     0.00%  postmaster  postgres            [.] 0x000000000023208d
+ 83.86% 0.00% postmaster postgres [.] AuxiliaryProcessMain
+   83.86%     0.00%  postmaster  postgres            [.] StartupProcessMain
+   83.63%     1.86%  postmaster  postgres            [.] StartupXLOG
+   45.85%     0.10%  postmaster  postgres            [.] heap2_redo
+ 33.44% 6.79% postmaster postgres [.] PageRepairFragmentation
+   24.60%    16.63%  postmaster  postgres            [.] pg_qsort
+   18.04%     0.23%  postmaster  postgres            [.] heap_redo
+ 17.07% 1.53% postmaster postgres [.] XLogReadBufferExtended + 16.20% 0.30% postmaster postgres [.] XLogReadBufferForRedoEx
+   14.38%     0.31%  postmaster  postgres            [.] ReadRecord
+   13.90%     1.29%  postmaster  postgres            [.] XLogReadRecord
+   12.40%     1.54%  postmaster  postgres            [.] heap_xlog_update
+   12.08%    12.06%  postmaster  postgres            [.] ValidXLogRecord
+ 11.73% 0.10% postmaster postgres [.] XLogReadBufferForRedo + 10.89% 0.27% postmaster postgres [.] ReadBufferWithoutRelcac
+    8.49%     1.07%  postmaster  libc-2.19.so        [.] __GI___libc_read
+    7.61%     0.71%  postmaster  postgres            [.] ReadBuffer_common
+    5.64%     0.48%  postmaster  postgres            [.] smgropen
+    5.48%     5.47%  postmaster  postgres            [.] itemoffcompare
+ 5.40% 5.38% postmaster postgres [.] hash_search_with_hash_v + 4.70% 4.69% postmaster libc-2.19.so [.] __memmove_ssse3_back + 4.30% 0.77% postmaster libc-2.19.so [.] __GI___libc_lseek64
+    4.29%     0.20%  postmaster  postgres            [.] heap_xlog_insert
+    3.88%     3.87%  postmaster  postgres            [.] swapfunc
+ 2.81% 0.09% postmaster postgres [.] XLogRecordPageWithFreeS
+    2.76%     0.00%          cp  libc-2.19.so        [.] __GI___libc_write
+    2.68%     0.07%  postmaster  postgres            [.] BufTableLookup
+    2.58%     2.58%  postmaster  postgres            [.] LWLockAcquire
+    2.17%     0.14%  postmaster  postgres            [.] tag_hash

So there's clearly some room for improvement here. A couple of ideas:

1. Replace the qsort with something cheaper. The itemid arrays being sorted are small, a few hundred item at most, usually even smaller. In this pgbench test case I used, the typical size is about 60. With a small array a plain insertion sort is cheaper than the generic qsort(), because it can avoid the function overhead etc. involved with generic qsort. Or we could use something smarter, like a radix sort, knowing that we're sorting small integers. Or we could implement an inlineable version of qsort and use that.

2. Instead of sorting the array and using memmove in-place, we could copy all the tuples to a temporary buffer in arbitrary order, and finally copy the temporary copy back to the buffer. That requires two memory copies per tuple, instead of one memmove, but memcpy() is pretty darn fast. It would be a loss when there are only a few large tuples on the page, so that avoiding the sort doesn't help, or when the tuples are mostly already in the correct places, so that most of the memmove()s are no-ops. But with a lot of small tuples, it would be a win, and it would be simple.

The second option would change behaviour slightly, as the tuples would be placed on the page in different physical order than before. It wouldn't be visible to to users, though.

I spent some time hacking approach 1, and replaced the qsort() call with a bucket sort. I'm not sure if a bucket sort is optimal, or better than a specialized quicksort implementation, but it seemed simple.

With the testcase I've been using - replaying about 2GB of WAL generated by pgbench - this reduces the replay time from about 55 s to 45 s.

Thoughts? Attached is the patch I put together. It's actually two patches: the first is just refactoring, putting the common code between PageRepairFragmentation, PageIndexMultiDelete, and PageIndexDeleteNoCompact to function. The second replaces the qsort().

- Heikki
>From 268e2cce8db21c873f10e62a308ff3b20826ecd7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Tue, 18 Nov 2014 17:05:41 +0200
Subject: [PATCH 1/2] Refactor

---
 src/backend/storage/page/bufpage.c | 130 +++++++++++++++----------------------
 1 file changed, 53 insertions(+), 77 deletions(-)

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 2b858c8..ac6e537 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -404,10 +404,9 @@ PageRestoreTempPage(Page tempPage, Page oldPage)
  */
 typedef struct itemIdSortData
 {
-	int			offsetindex;	/* linp array index */
-	int			itemoff;		/* page offset of item data */
-	Size		alignedlen;		/* MAXALIGN(item data len) */
-	ItemIdData	olditemid;		/* used only in PageIndexMultiDelete */
+	uint16		offsetindex;	/* linp array index */
+	int16		itemoff;		/* page offset of item data */
+	uint16		alignedlen;		/* MAXALIGN(item data len) */
 } itemIdSortData;
 typedef itemIdSortData *itemIdSort;
 
@@ -420,6 +419,38 @@ itemoffcompare(const void *itemidp1, const void *itemidp2)
 }
 
 /*
+ * After removing or marking some the line pointers unused, move the tuples
+ * to remove the gaps caused by the removed items.
+ */
+static void
+compactify_tuples(itemIdSort itemidbase, int nitems, Page page)
+{
+	PageHeader	phdr = (PageHeader) page;
+	Offset		upper;
+	int			i;
+
+	/* sort itemIdSortData array into decreasing itemoff order */
+	qsort((char *) itemidbase, nitems, sizeof(itemIdSortData),
+		  itemoffcompare);
+
+	upper = phdr->pd_special;
+	for (i = 0; i < nitems; i++)
+	{
+		itemIdSort	itemidptr = &itemidbase[i];
+		ItemId		lp;
+
+		lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+		upper -= itemidptr->alignedlen;
+		memmove((char *) page + upper,
+				(char *) page + itemidptr->itemoff,
+				itemidptr->alignedlen);
+		lp->lp_off = upper;
+	}
+
+	phdr->pd_upper = upper;
+}
+
+/*
  * PageRepairFragmentation
  *
  * Frees fragmented space on a page.
@@ -441,7 +472,6 @@ PageRepairFragmentation(Page page)
 				nunused;
 	int			i;
 	Size		totallen;
-	Offset		upper;
 
 	/*
 	 * It's worth the trouble to be more paranoid here than in most places,
@@ -515,24 +545,7 @@ PageRepairFragmentation(Page page)
 			   errmsg("corrupted item lengths: total %u, available space %u",
 					  (unsigned int) totallen, pd_special - pd_lower)));
 
-		/* sort itemIdSortData array into decreasing itemoff order */
-		qsort((char *) itemidbase, nstorage, sizeof(itemIdSortData),
-			  itemoffcompare);
-
-		/* compactify page */
-		upper = pd_special;
-
-		for (i = 0, itemidptr = itemidbase; i < nstorage; i++, itemidptr++)
-		{
-			lp = PageGetItemId(page, itemidptr->offsetindex + 1);
-			upper -= itemidptr->alignedlen;
-			memmove((char *) page + upper,
-					(char *) page + itemidptr->itemoff,
-					itemidptr->alignedlen);
-			lp->lp_off = upper;
-		}
-
-		((PageHeader) page)->pd_upper = upper;
+		compactify_tuples(itemidbase, nstorage, page);
 	}
 
 	/* Set hint bit for PageAddItem */
@@ -782,13 +795,12 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 	Offset		pd_upper = phdr->pd_upper;
 	Offset		pd_special = phdr->pd_special;
 	itemIdSortData itemidbase[MaxIndexTuplesPerPage];
+	ItemIdData	newitemids[MaxIndexTuplesPerPage];
 	itemIdSort	itemidptr;
 	ItemId		lp;
 	int			nline,
 				nused;
-	int			i;
 	Size		totallen;
-	Offset		upper;
 	Size		size;
 	unsigned	offset;
 	int			nextitm;
@@ -857,9 +869,9 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 		{
 			itemidptr->offsetindex = nused;		/* where it will go */
 			itemidptr->itemoff = offset;
-			itemidptr->olditemid = *lp;
 			itemidptr->alignedlen = MAXALIGN(size);
 			totallen += itemidptr->alignedlen;
+			newitemids[nused] = *lp;
 			itemidptr++;
 			nused++;
 		}
@@ -875,26 +887,15 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 			   errmsg("corrupted item lengths: total %u, available space %u",
 					  (unsigned int) totallen, pd_special - pd_lower)));
 
-	/* sort itemIdSortData array into decreasing itemoff order */
-	qsort((char *) itemidbase, nused, sizeof(itemIdSortData),
-		  itemoffcompare);
-
-	/* compactify page and install new itemids */
-	upper = pd_special;
-
-	for (i = 0, itemidptr = itemidbase; i < nused; i++, itemidptr++)
-	{
-		lp = PageGetItemId(page, itemidptr->offsetindex + 1);
-		upper -= itemidptr->alignedlen;
-		memmove((char *) page + upper,
-				(char *) page + itemidptr->itemoff,
-				itemidptr->alignedlen);
-		*lp = itemidptr->olditemid;
-		lp->lp_off = upper;
-	}
-
+	/*
+	 * Looks good. Overwrite the line pointers with the copy, from which we've
+	 * removed all the unused items.
+	 */
+	memcpy(phdr->pd_linp, newitemids, nused * sizeof(ItemIdData));
 	phdr->pd_lower = SizeOfPageHeaderData + nused * sizeof(ItemIdData);
-	phdr->pd_upper = upper;
+
+	/* and compactify the tuple data */
+	compactify_tuples(itemidbase, nused, page);
 }
 
 /*
@@ -1000,7 +1001,6 @@ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
 		itemIdSort	itemidptr;
 		int			i;
 		Size		totallen;
-		Offset		upper;
 
 		/*
 		 * Scan the page taking note of each item that we need to preserve.
@@ -1012,7 +1012,8 @@ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
 		 */
 		itemidptr = itemidbase;
 		totallen = 0;
-		for (i = 0; i < nline; i++, itemidptr++)
+		PageClearHasFreeLinePointers(page);
+		for (i = 0; i < nline; i++)
 		{
 			ItemId		lp;
 
@@ -1024,13 +1025,15 @@ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
 				itemidptr->itemoff = ItemIdGetOffset(lp);
 				itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
 				totallen += itemidptr->alignedlen;
+				itemidptr++;
 			}
 			else
 			{
-				itemidptr->itemoff = 0;
-				itemidptr->alignedlen = 0;
+				PageSetHasFreeLinePointers(page);
+				ItemIdSetUnused(lp);
 			}
 		}
+		nline = itemidptr - itemidbase;
 		/* By here, there are exactly nline elements in itemidbase array */
 
 		if (totallen > (Size) (pd_special - pd_lower))
@@ -1039,38 +1042,11 @@ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
 					 errmsg("corrupted item lengths: total %u, available space %u",
 							(unsigned int) totallen, pd_special - pd_lower)));
 
-		/* sort itemIdSortData array into decreasing itemoff order */
-		qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
-			  itemoffcompare);
-
 		/*
 		 * Defragment the data areas of each tuple, being careful to preserve
 		 * each item's position in the linp array.
 		 */
-		upper = pd_special;
-		PageClearHasFreeLinePointers(page);
-		for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
-		{
-			ItemId		lp;
-
-			lp = PageGetItemId(page, itemidptr->offsetindex + 1);
-			if (itemidptr->alignedlen == 0)
-			{
-				PageSetHasFreeLinePointers(page);
-				ItemIdSetUnused(lp);
-				continue;
-			}
-			upper -= itemidptr->alignedlen;
-			memmove((char *) page + upper,
-					(char *) page + itemidptr->itemoff,
-					itemidptr->alignedlen);
-			lp->lp_off = upper;
-			/* lp_flags and lp_len remain the same as originally */
-		}
-
-		/* Set the new page limits */
-		phdr->pd_upper = upper;
-		phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+		compactify_tuples(itemidbase, nline, page);
 	}
 }
 
-- 
2.1.3

>From 8ea04a74b34f0f86d9e49a8001fb6e452a3e39cb Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Tue, 18 Nov 2014 19:48:35 +0200
Subject: [PATCH 2/2] Bucket sort

---
 src/backend/storage/page/bufpage.c | 96 +++++++++++++++++++++++++++++++++++---
 1 file changed, 89 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index ac6e537..c30dd55 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -410,12 +410,94 @@ typedef struct itemIdSortData
 } itemIdSortData;
 typedef itemIdSortData *itemIdSort;
 
-static int
-itemoffcompare(const void *itemidp1, const void *itemidp2)
+/*
+ * Sort an array of itemIdSort's on itemoff, descending.
+ *
+ * This performs a simple insertion sort. The arrays are expected to be
+ * very small, so this is faster than a qsort(), thanks to all the function
+ * call etc. overhead in the generic qsort(), .
+ */
+static inline void
+sort_itemIds_small(itemIdSort itemidbase, int nitems)
+{
+	int			i,
+				j;
+
+	for (i = 1; i < nitems; i++)
+	{
+		itemIdSortData tmp = itemidbase[i];
+
+		for (j = i; j > 0 && itemidbase[j - 1].itemoff < tmp.itemoff; j--)
+			itemidbase[j] = itemidbase[j - 1];
+		itemidbase[j] = tmp;
+	}
+}
+
+/*
+ * Sort an array of itemIdSort's on itemoff, descending.
+ *
+ * The caller must pass a working array of same size as the source array.
+ * This function returns a pointer to the sorted array, which can be the
+ * original array, or the working array.
+ */
+static itemIdSort
+sort_itemIds(itemIdSort itemidbase, int nitems, itemIdSort work)
 {
-	/* Sort in decreasing itemoff order */
-	return ((itemIdSort) itemidp2)->itemoff -
-		((itemIdSort) itemidp1)->itemoff;
+#define N_SORT_BUCKETS			32
+#define SORT_BUCKET_SIZE		(BLCKSZ / 32)
+	uint16		counts[N_SORT_BUCKETS];
+	uint16		positions[N_SORT_BUCKETS];
+	int			i,
+				j;
+	uint16		total;
+
+	/*
+	 * For a small number of entries, just do an insertion sort. Otherwise
+	 * perform a bucket sort.
+	 */
+	if (nitems < 20)
+	{
+		sort_itemIds_small(itemidbase, nitems);
+
+		return itemidbase;
+	}
+
+	/* First count the number of entries that will fall into each bucket */
+	memset(counts, 0, sizeof(counts));
+	for (i = 0; i < nitems; i++)
+	{
+		j = itemidbase[i].itemoff / SORT_BUCKET_SIZE;
+		counts[j]++;
+	}
+
+	/*
+	 * Calculate the bucket boundaries in the destination array.
+	 * After this, positions[i] is the last index + 1 belonging to bucket i.
+	 */
+	total = nitems;
+	for (i = 0; i < N_SORT_BUCKETS; i++)
+	{
+		positions[i] = total;
+		total -= counts[i];
+	}
+
+	/*
+	 * Copy each item to the correct bucket. As a side effect, the positions
+	 * array is modified so that positions[i] becomes the first index belonging
+	 * to bucket i.
+	 */
+	for (i = 0; i < nitems; i++)
+	{
+		j = itemidbase[i].itemoff / SORT_BUCKET_SIZE;
+		positions[j]--;
+		work[positions[j]] = itemidbase[i];
+	}
+
+	/* Finally, sort each bucket, using insertion sort. */
+	for (i = 0; i < N_SORT_BUCKETS; i++)
+		sort_itemIds_small(&work[positions[i]], counts[i]);
+
+	return work;
 }
 
 /*
@@ -428,10 +510,10 @@ compactify_tuples(itemIdSort itemidbase, int nitems, Page page)
 	PageHeader	phdr = (PageHeader) page;
 	Offset		upper;
 	int			i;
+	itemIdSortData work[Max(MaxIndexTuplesPerPage, MaxHeapTuplesPerPage)];
 
 	/* sort itemIdSortData array into decreasing itemoff order */
-	qsort((char *) itemidbase, nitems, sizeof(itemIdSortData),
-		  itemoffcompare);
+	itemidbase = sort_itemIds(itemidbase, nitems, work);
 
 	upper = phdr->pd_special;
 	for (i = 0; i < nitems; i++)
-- 
2.1.3

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to