Hello,

Currently bigger shared_buffers settings don't combine well with
relations being extended frequently. Especially if many/most pages have
a high usagecount and/or are dirty and the system is IO constrained.

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
   block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized


The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

I'm not sure whether the above is the best solution however. For one I
think it's not necessarily a good idea to opencode this in hio.c - I've
not observed it, but this probably can happen for btrees and such as
well. For another, this is still a exclusive lock while we're doing IO:
smgrextend() wants a page to write out, so we have to be careful not to
overwrite things.

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I've attached my POC for heap extension, but it's really just a POC at
this point.

Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From f1f38829160d0f2998cd187f2f920cdc7b1fa709 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Sun, 29 Mar 2015 20:55:32 +0200
Subject: [PATCH] WIP: Saner heap extension.

---
 src/backend/access/heap/hio.c | 110 ++++++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6d091f6..178c417 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -15,6 +15,8 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+
 #include "access/heapam.h"
 #include "access/hio.h"
 #include "access/htup_details.h"
@@ -420,63 +422,77 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	/*
 	 * Have to extend the relation.
 	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
+	 * To avoid, as it used to be the case, holding the extension lock during
+	 * victim buffer search for the new buffer, we extend the relation here
+	 * instead of relying on bufmgr.c. We still have to hold the extension
+	 * lock to prevent a race between two backends initializing the same page.
 	 */
-	needLock = !RELATION_IS_LOCAL(relation);
+	while(true)
+	{
+		char		emptybuf[BLCKSZ];
 
-	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+		/*
+		 * We have to use a lock to ensure no one else is extending the rel at
+		 * the same time, else we will both try to initialize the same new
+		 * page.  We can skip locking for new or temp relations, however,
+		 * since no one else could be accessing them.
+		 */
+		needLock = !RELATION_IS_LOCAL(relation);
+		RelationOpenSmgr(relation);
 
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
+		MemSet((char *) emptybuf, 0, BLCKSZ);
+		PageInit((Page) emptybuf, BLCKSZ, 0);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		/*
+		 * Acquire extension lock to avoid two backends extending the
+		 * relation at the same time. This could be avoided by using
+		 * lseek(SEEK_END, +BLKSZ) *without* immediately writing to the
+		 * block. Then read in the page and only initialize after locking
+		 * it. Unclear whether it's a benefit or whether it might be too
+		 * likely to result in sparse files.
+		 */
+		if (needLock)
+			LockRelationForExtension(relation, ExclusiveLock);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		targetBlock = smgrnblocks(relation->rd_smgr, MAIN_FORKNUM);
+		smgrextend(relation->rd_smgr, MAIN_FORKNUM, targetBlock,
+					   emptybuf, false);
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		if (needLock)
+			UnlockRelationForExtension(relation, ExclusiveLock);
 
-	/*
-	 * We need to initialize the empty new page.  Double-check that it really
-	 * is empty (this should never happen, but if it does we don't want to
-	 * risk wiping out valid data).
-	 */
-	page = BufferGetPage(buffer);
+		buffer = ReadBufferBI(relation, targetBlock, bistate);
+
+		/*
+		 * We can be certain that locking the otherBuffer first is OK,
+		 * since it must have a lower page number.
+		 */
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
-			 RelationGetRelationName(relation));
+		page = BufferGetPage(buffer);
 
-	PageInit(page, BufferGetPageSize(buffer), 0);
+		Assert(!PageIsNew(page));
 
-	if (len > PageGetHeapFreeSpace(page))
-	{
-		/* We should not get here given the test at the top */
-		elog(PANIC, "tuple is too big: size %zu", len);
+		/*
+		 * While unlikely, it's possible that another backend managed to use
+		 * up the free space till we got the exclusive lock. That'd require
+		 * the page to be vacuumed (to be put on the free space list) and then
+		 * be used; possible but fairly unlikely in practice. If it happens,
+		 * just retry.
+		 */
+		if (len <= PageGetHeapFreeSpace(page))
+			break;
+
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+
+		CHECK_FOR_INTERRUPTS();
 	}
 
 	/*
-- 
2.3.0.149.gf3f4077

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to