Hi,

In an effort to speed up bulk data loading/transforming I noticed that considerable time is spent in the relation extension lock. I know there are already many other efforts to increase the chances of using bulk loading [1], [2], [3], [4], efforts to make loading more parallel [5], and removing some syscalls [6], as well as completely new I/O systems [7] and some more extreme measures like disabling wal logging [8].

Whilst they will all help, they will ultimately be stopped by the relation extension lock. Also it seems from the tests I've done so far that at least for bulk loading using pwrite() actually can carry us rather far as long as we are not doing it under a global lock. Moreover, the solution provided here might be an alternative to [7] because the results are quite promising, even with WAL enabled.

Attached two WIP patches in the hopes of getting feedback.
The first changes the way we do bulk loading: each backend now gets a standalone set of blocks allocated that are local to that backend. This helps both with reducing contention but also with some OS writeback mechanisms. The second patch reduces the time spent on locking the partition buffers by shifting around the logic to make each set of 128 blocks use the same buffer partition, and then adding a custom function to get buffer blocks specifically for extension, whilst keeping a previous partition lock, thereby reducing the amount of time we spent on futexes.

The design:
- add a set of buffers into the BulkInsertState that can be used by the backend without any locks. - add a ReadBufferExtendBulk function which extends the relation with BULK_INSERT_BATCH_SIZE blocks at once. - rework FileWrite to have a parameter to speed up relation extension by passing in if we are using filewrite just to extend the file. if supported uses ftruncate as this is much faster (also than posix_fallocate on my system) and according to the manpages (https://linux.die.net/man/2/ftruncate) should read as zeroed space. to be cleaned-up later possibly into a special function FileExtend().
- rework mdextend to get a page count.
- make a specialized version of BufferAlloc called BufferAllocExtend which keeps around the lock on the last buffer partition and tries to reuse this so that there are a lot less futex calls.

Many things that are still to be done; some are:
- reintroduce FSM again, and possibly optimize the lock usage there. in other words: this patch currently can only start the server and run COPY FROM and read queries. - look into the wal logging. whilst it seems to be fairly optimal already wrt the locking and such i noticed there seems to be an alternating pattern between the bgwriter and the workers. whilst setting some parameters bigger helped a lot (wal_writer_flush_after, wal_writer_delay, wal_buffers) - work nicely with the code from [6] so that the register_dirty_segment is indeed not needed anymore; or alternatively optimize that code so that less locks are needed. - make use of [9] in the fallback case in FileWrite() when ftruncate/fallocate is not available so that the buffer size can be reduced.

First results are below; all tests were loading 32 times the same 1G lineitem csv into the same table. tests were done both on a nvme and the more parallel ones also with a tmpfs disk to see potential disk bottlenecks and e.g. potential wrt using NVDIMM.
=================================
using an unlogged table:
NVME, UNLOGGED table, 4 parallel streams:   HEAD 171s, patched 113s
NVME, UNLOGGED table, 8 parallel streams:   HEAD 113s, patched 67s
NVME, UNLOGGED table, 16 parallel streams:  HEAD 112s, patched 42s
NVME, UNLOGGED table, 32 parallel streams:  HEAD 121s, patched 36s
tmpfs, UNLOGGED table, 16 parallel streams: HEAD 96s, patched 38s
tmpfs, UNLOGGED table, 32 parallel streams: HEAD 104s, patched 25s
=================================
using a normal table, wal-level=minimal, 16mb wal segments:
NVME, 4 parallel streams:   HEAD 237s, patched 157s
NVME, 8 parallel streams:   HEAD 200s, patched 142s
NVME, 16 parallel streams:  HEAD 171s, patched 145s
NVME, 32 parallel streams:  HEAD 199s, patched 146s
tmpfs, 16 parallel streams: HEAD 131s, patched 89s
tmpfs, 32 parallel streams: HEAD 148s, patched 98s
=================================
using a normal table, wal-level=minimal, 256mb wal segments,
wal_init_zero = off, wal_buffers = 262143, wal_writer_delay = 10000ms,
wal_writer_flush_after = 512MB

NVME, 4 parallel streams:   HEAD 201s, patched 159s
NVME, 8 parallel streams:   HEAD 157s, patched 109s
NVME, 16 parallel streams:  HEAD 150s, patched 78s
NVME, 32 parallel streams:  HEAD 158s, patched 70s
tmpfs, 16 parallel streams: HEAD 106s, patched 54s
tmpfs, 32 parallel streams: HEAD 113s, patched 44s
=================================

Thoughts?

Cheers,
Luc
Swarm64


[1] https://www.postgresql.org/message-id/flat/CAJcOf-f%3DUX1uKbPjDXf%2B8gJOoEPz9VCzh7pKnknfU4sG4LXj0A%40mail.gmail.com#49fe9f2ffcc9916cc5ed3a712aa5f28f [2] https://www.postgresql.org/message-id/flat/CALj2ACWjymmyTvvhU20Er-LPLaBjzBQxMJwr4nzO7XWmOkxhsg%40mail.gmail.com#be34b5b7861876fc0fd7edb621c067fa [3] https://www.postgresql.org/message-id/flat/CALj2ACXg-4hNKJC6nFnepRHYT4t5jJVstYvri%2BtKQHy7ydcr8A%40mail.gmail.com [4] https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40mail.gmail.com [5] https://www.postgresql.org/message-id/flat/CALDaNm3GaZyYPpGu-PpF0SEkJg-eaW3TboHxpxJ-2criv2j_eA%40mail.gmail.com#07292ce654ef58fae7f257a4e36afc41 [6] https://www.postgresql.org/message-id/flat/20200203132319.x7my43whtefeznz7%40alap3.anarazel.de#85a2a0ab915cdf079862d70505abe3db [7] https://www.postgresql.org/message-id/flat/20201208040227.7rlzpfpdxoau4pvd%40alap3.anarazel.de#b8ea4a3b7f37e88ddfe121c4b3075e7b [8] https://www.postgresql.org/message-id/flat/CAD21AoA9oK1VOoUuKW-jEO%3DY2nt5kCQKKFgeQwwRUMoh6BE-ow%40mail.gmail.com#0475248a5ff7aed735be41fd4034ae36 [9] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHf_R_ih1pkBMTWn%3DSTyKMOM2Ks47Y_UqqfU1wRc1VvA%40mail.gmail.com#7a53ad72331e423ba3c6a50e6dc1259f
>From b08074c5141967a8e6e805fd9c15c1a0450af2f7 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <l...@swarm64.com>
Date: Thu, 31 Dec 2020 09:11:54 +0100
Subject: [PATCH v1 1/2] WIP: local bulk allocation

---
 src/backend/access/heap/heapam.c      |  12 ++
 src/backend/access/heap/hio.c         | 178 ++++++--------------------
 src/backend/access/heap/rewriteheap.c |   2 +-
 src/backend/storage/buffer/bufmgr.c   |  60 ++++++++-
 src/backend/storage/file/buffile.c    |   3 +-
 src/backend/storage/file/fd.c         |  16 ++-
 src/backend/storage/smgr/md.c         |  56 ++++----
 src/backend/storage/smgr/smgr.c       |  15 ++-
 src/include/access/hio.h              |   4 +
 src/include/storage/bufmgr.h          |   2 +
 src/include/storage/fd.h              |   2 +-
 src/include/storage/md.h              |   2 +-
 src/include/storage/smgr.h            |   2 +
 13 files changed, 183 insertions(+), 171 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 26c2006f23..0a923c1ffd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1810,6 +1810,10 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->local_buffers_idx = BULK_INSERT_BATCH_SIZE;
+	for (int i=0; i<BULK_INSERT_BATCH_SIZE; ++i)
+		bistate->local_buffers[i] = InvalidBuffer;
+	bistate->empty_buffer = palloc0(BULK_INSERT_BATCH_SIZE * BLCKSZ);
 	return bistate;
 }
 
@@ -1821,7 +1825,15 @@ FreeBulkInsertState(BulkInsertState bistate)
 {
 	if (bistate->current_buf != InvalidBuffer)
 		ReleaseBuffer(bistate->current_buf);
+	for (int i=bistate->local_buffers_idx; i<BULK_INSERT_BATCH_SIZE; ++i)
+		if (bistate->local_buffers[i] != InvalidBuffer)
+		{
+			// FSM?
+			//LockBuffer(bistate->local_buffers[i], BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(bistate->local_buffers[i]);
+		}
 	FreeAccessStrategy(bistate->strategy);
+	pfree(bistate->empty_buffer);
 	pfree(bistate);
 }
 
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index ca357410a2..9111badd2f 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -24,6 +24,7 @@
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 
+#include "miscadmin.h"
 
 /*
  * RelationPutHeapTuple - place tuple at specified page
@@ -118,9 +119,19 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
 		bistate->current_buf = InvalidBuffer;
 	}
 
-	/* Perform a read using the buffer strategy */
-	buffer = ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
-								mode, bistate->strategy);
+	if (targetBlock == P_NEW && mode == RBM_ZERO_AND_LOCK && bistate->local_buffers_idx < BULK_INSERT_BATCH_SIZE)
+	{
+		/* If we have a local buffer remaining, use that */
+		buffer = bistate->local_buffers[bistate->local_buffers_idx++];
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		Assert(buffer != InvalidBuffer);
+	}
+	else
+	{
+		/* Perform a read using the buffer strategy */
+		buffer = ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
+									mode, bistate->strategy);
+	}
 
 	/* Save the selected block as target for future inserts */
 	IncrBufferRefCount(buffer);
@@ -187,90 +198,6 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	}
 }
 
-/*
- * Extend a relation by multiple blocks to avoid future contention on the
- * relation extension lock.  Our goal is to pre-extend the relation by an
- * amount which ramps up as the degree of contention ramps up, but limiting
- * the result to some sane overall value.
- */
-static void
-RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
-{
-	BlockNumber blockNum,
-				firstBlock = InvalidBlockNumber;
-	int			extraBlocks;
-	int			lockWaiters;
-
-	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
-	if (lockWaiters <= 0)
-		return;
-
-	/*
-	 * It might seem like multiplying the number of lock waiters by as much as
-	 * 20 is too aggressive, but benchmarking revealed that smaller numbers
-	 * were insufficient.  512 is just an arbitrary cap to prevent
-	 * pathological results.
-	 */
-	extraBlocks = Min(512, lockWaiters * 20);
-
-	do
-	{
-		Buffer		buffer;
-		Page		page;
-		Size		freespace;
-
-		/*
-		 * Extend by one page.  This should generally match the main-line
-		 * extension code in RelationGetBufferForTuple, except that we hold
-		 * the relation extension lock throughout, and we don't immediately
-		 * initialize the page (see below).
-		 */
-		buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-		page = BufferGetPage(buffer);
-
-		if (!PageIsNew(page))
-			elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-				 BufferGetBlockNumber(buffer),
-				 RelationGetRelationName(relation));
-
-		/*
-		 * Add the page to the FSM without initializing. If we were to
-		 * initialize here, the page would potentially get flushed out to disk
-		 * before we add any useful content. There's no guarantee that that'd
-		 * happen before a potential crash, so we need to deal with
-		 * uninitialized pages anyway, thus avoid the potential for
-		 * unnecessary writes.
-		 */
-
-		/* we'll need this info below */
-		blockNum = BufferGetBlockNumber(buffer);
-		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
-
-		UnlockReleaseBuffer(buffer);
-
-		/* Remember first block number thus added. */
-		if (firstBlock == InvalidBlockNumber)
-			firstBlock = blockNum;
-
-		/*
-		 * Immediately update the bottom level of the FSM.  This has a good
-		 * chance of making this page visible to other concurrently inserting
-		 * backends, and we want that to happen without delay.
-		 */
-		RecordPageWithFreeSpace(relation, blockNum, freespace);
-	}
-	while (--extraBlocks > 0);
-
-	/*
-	 * Updating the upper levels of the free space map is too expensive to do
-	 * for every block, but it's worth doing once at the end to make sure that
-	 * subsequent insertion activity sees all of those nifty free pages we
-	 * just inserted.
-	 */
-	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
-}
-
 /*
  * RelationGetBufferForTuple
  *
@@ -333,14 +260,14 @@ RelationGetBufferForTuple(Relation relation, Size len,
 						  BulkInsertState bistate,
 						  Buffer *vmbuffer, Buffer *vmbuffer_other)
 {
-	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
+	bool		use_fsm = false; //!(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
 	Size		pageFreeSpace = 0,
 				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
-	bool		needLock;
+	bool		needLock = false;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -396,19 +323,6 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		 * target.
 		 */
 		targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
-
-		/*
-		 * If the FSM knows nothing of the rel, try the last page before we
-		 * give up and extend.  This avoids one-tuple-per-page syndrome during
-		 * bootstrapping or in a recently-started system.
-		 */
-		if (targetBlock == InvalidBlockNumber)
-		{
-			BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-
-			if (nblocks > 0)
-				targetBlock = nblocks - 1;
-		}
 	}
 
 loop:
@@ -541,37 +455,36 @@ loop:
 		 * Update FSM as to condition of this page, and ask for another page
 		 * to try.
 		 */
-		targetBlock = RecordAndGetPageWithFreeSpace(relation,
+		targetBlock = InvalidBlockNumber; 
+		/*RecordAndGetPageWithFreeSpace(relation,
 													targetBlock,
 													pageFreeSpace,
-													len + saveFreeSpace);
+													len + saveFreeSpace);*/
 	}
 
-	/*
-	 * Have to extend the relation.
-	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
-	 */
-	needLock = !RELATION_IS_LOCAL(relation);
 
-	/*
-	 * If we need the lock but are not able to acquire it immediately, we'll
-	 * consider extending the relation by multiple blocks at a time to manage
-	 * contention on the relation extension lock.  However, this only makes
-	 * sense if we're using the FSM; otherwise, there's no point.
-	 */
-	if (needLock)
+	if (bistate)
 	{
-		if (!use_fsm)
+		if (bistate->local_buffers_idx == BULK_INSERT_BATCH_SIZE)
+			/* Ran out of local buffers for the bulk state, allocate some more */
+			ReadBufferExtendBulk(relation, bistate);
+	}
+	else
+	{
+		/*
+		 * Have to extend the relation.
+		 *
+		 * We have to use a lock to ensure no one else is extending the rel at the
+		 * same time, else we will both try to initialize the same new page.  We
+		 * can skip locking for new or temp relations, however, since no one else
+		 * could be accessing them.
+		 */
+		needLock = !RELATION_IS_LOCAL(relation);
+
+		if (needLock)
 			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		if (use_fsm)
 		{
-			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
-
 			/*
 			 * Check if some other backend has extended a block for us while
 			 * we were waiting on the lock.
@@ -587,21 +500,9 @@ loop:
 				UnlockRelationForExtension(relation, ExclusiveLock);
 				goto loop;
 			}
-
-			/* Time to bulk-extend. */
-			RelationAddExtraBlocks(relation, bistate);
 		}
 	}
 
-	/*
-	 * In addition to whatever extension we performed above, we always add at
-	 * least one block to satisfy our own request.
-	 *
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
 	/*
@@ -612,7 +513,8 @@ loop:
 	page = BufferGetPage(buffer);
 
 	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
+		elog(ERROR, "%d page %u of relation \"%s\" should be empty but is not",
+			 MyProcPid,
 			 BufferGetBlockNumber(buffer),
 			 RelationGetRelationName(relation));
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 65942cc428..e1f2ac8faa 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -912,7 +912,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		 * check the above "Logical rewrite support" comment for reasoning.
 		 */
 		written = FileWrite(src->vfd, waldata_start, len, src->off,
-							WAIT_EVENT_LOGICAL_REWRITE_WRITE);
+							WAIT_EVENT_LOGICAL_REWRITE_WRITE, false);
 		if (written != len)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c5e8707151..d1ce88fee5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/tableam.h"
+#include "access/hio.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
@@ -47,6 +48,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
+#include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
@@ -824,8 +826,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 		if (!PageIsNew((Page) bufBlock))
 			ereport(ERROR,
-					(errmsg("unexpected data beyond EOF in block %u of relation %s",
-							blockNum, relpath(smgr->smgr_rnode, forkNum)),
+					(errmsg("%d unexpected data beyond EOF in block %u of relation %s",
+							MyProcPid, blockNum, relpath(smgr->smgr_rnode, forkNum)),
 					 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
 
 		/*
@@ -985,6 +987,60 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+void
+ReadBufferExtendBulk(Relation reln, struct BulkInsertStateData* bistate)
+{
+	char relpersistence = reln->rd_rel->relpersistence;
+	SMgrRelation smgr;
+	BufferDesc *bufHdr;
+	BlockNumber	firstBlock,
+				blockNum;
+	Block		bufBlock;
+	bool		found;
+
+	/* Open it at the smgr level if not already done */
+	RelationOpenSmgr(reln);
+	smgr = reln->rd_smgr;
+
+	if (SmgrIsTemp(smgr))
+		elog(ERROR, "Bulk extend for temporary tables not supported");
+
+	LockRelationForExtension(reln, ExclusiveLock);
+	firstBlock = smgrnblocks(smgr, MAIN_FORKNUM);
+	smgrextend_count(smgr, MAIN_FORKNUM, firstBlock, bistate->empty_buffer, false, BULK_INSERT_BATCH_SIZE);
+	UnlockRelationForExtension(reln, ExclusiveLock);
+
+	for (int i=0; i<BULK_INSERT_BATCH_SIZE; ++i)
+	{
+		blockNum = firstBlock + i;
+
+		pgstat_count_buffer_read(reln);
+		/* Make sure we will have room to remember the buffer pin */
+		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+		bufHdr = BufferAlloc(smgr, relpersistence, MAIN_FORKNUM, blockNum,
+							 bistate->strategy, &found);
+		pgBufferUsage.shared_blks_written++;
+
+		Assert(!found);
+
+		bufBlock = BufHdrGetBlock(bufHdr);
+
+		/* new buffers are zero-filled */
+		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
+		TerminateBufferIO(bufHdr, false, BM_VALID);
+		bistate->local_buffers[i] = BufferDescriptorGetBuffer(bufHdr);
+	}
+
+	bistate->local_buffers_idx = 0;
+
+	VacuumPageMiss += BULK_INSERT_BATCH_SIZE;
+	if (VacuumCostActive)
+		VacuumCostBalance += VacuumCostPageMiss * BULK_INSERT_BATCH_SIZE;
+}
+
+
 /*
  * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
  *		buffer.  If no buffer exists already, selects a replacement
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index d581f96eda..5055674090 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -499,7 +499,8 @@ BufFileDumpBuffer(BufFile *file)
 								 file->buffer.data + wpos,
 								 bytestowrite,
 								 file->curOffset,
-								 WAIT_EVENT_BUFFILE_WRITE);
+								 WAIT_EVENT_BUFFILE_WRITE,
+								 false);
 		if (bytestowrite <= 0)
 			ereport(ERROR,
 					(errcode_for_file_access(),
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f07b5325aa..0f11bc98c4 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2058,7 +2058,7 @@ retry:
 
 int
 FileWrite(File file, char *buffer, int amount, off_t offset,
-		  uint32 wait_event_info)
+		  uint32 wait_event_info, bool is_empty)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2104,7 +2104,21 @@ FileWrite(File file, char *buffer, int amount, off_t offset,
 retry:
 	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
+#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__)
+	if (is_empty)
+	{
+		do
+		{
+			returnCode = ftruncate(VfdCache[file].fd, offset + amount);
+			//returnCode = fallocate(VfdCache[file].fd, 0, offset, amount);
+		} while (unlikely(returnCode == EINTR && !(ProcDiePending || QueryCancelPending)));
+		returnCode = amount;
+	}
+	else
+		returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+#else
 	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+#endif
 	pgstat_report_wait_end();
 
 	/* if write didn't set errno, assume problem is no disk space */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 9889ad6ad8..741da34656 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -412,10 +412,13 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
  */
 void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 char *buffer, bool skipFsync)
+		 char *buffer, bool skipFsync, int count)
 {
 	off_t		seekpos;
+	BlockNumber block_in_seg;
+	BlockNumber count_in_seg;
 	int			nbytes;
+	int			write_size;
 	MdfdVec    *v;
 
 	/* This assert is too expensive to have on normally ... */
@@ -435,31 +438,40 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						relpath(reln->smgr_rnode, forknum),
 						InvalidBlockNumber)));
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
+	do {
+		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		block_in_seg = (blocknum % ((BlockNumber) RELSEG_SIZE));
+		count_in_seg = Min((RELSEG_SIZE - block_in_seg), count);
+		seekpos = (off_t) BLCKSZ * block_in_seg;
+		write_size = BLCKSZ * count_in_seg;
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
-	{
-		if (nbytes < 0)
+		if ((nbytes = FileWrite(v->mdfd_vfd, buffer, write_size, seekpos, WAIT_EVENT_DATA_FILE_EXTEND, true)) != write_size)
+		{
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not extend file \"%s\": %m",
+								FilePathName(v->mdfd_vfd)),
+						 errhint("Check free disk space.")));
+			/* short write: complain appropriately */
 			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not extend file \"%s\": %m",
-							FilePathName(v->mdfd_vfd)),
+					(errcode(ERRCODE_DISK_FULL),
+					 errmsg("could not extend file \"%s\": wrote only %d of %d bytes at block %u",
+							FilePathName(v->mdfd_vfd),
+							nbytes, write_size, blocknum),
 					 errhint("Check free disk space.")));
-		/* short write: complain appropriately */
-		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not extend file \"%s\": wrote only %d of %d bytes at block %u",
-						FilePathName(v->mdfd_vfd),
-						nbytes, BLCKSZ, blocknum),
-				 errhint("Check free disk space.")));
-	}
+		}
 
-	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		count -= count_in_seg;
+		buffer += write_size;
+		blocknum += count_in_seg;
+	} while (count > 0);
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
@@ -719,7 +731,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 
 	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE, false);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
 										reln->smgr_rnode.node.spcNode,
@@ -1254,7 +1266,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 				mdextend(reln, forknum,
 						 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
-						 zerobuf, skipFsync);
+						 zerobuf, skipFsync, 1);
 				pfree(zerobuf);
 			}
 			flags = O_CREAT;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 072bdd118f..1ad6f5d7de 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -49,7 +49,7 @@ typedef struct f_smgr
 	void		(*smgr_unlink) (RelFileNodeBackend rnode, ForkNumber forknum,
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum, char *buffer, bool skipFsync);
+								BlockNumber blocknum, char *buffer, bool skipFsync, int count);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
@@ -461,17 +461,24 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
 void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		   char *buffer, bool skipFsync)
+{
+	return smgrextend_count(reln, forknum, blocknum, buffer, skipFsync, 1);
+}
+
+void
+smgrextend_count(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   char *buffer, bool skipFsync, int count)
 {
 	smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
-										 buffer, skipFsync);
+										 buffer, skipFsync, count);
 
 	/*
-	 * Normally we expect this to increase nblocks by one, but if the cached
+	 * Normally we expect this to increase nblocks by count, but if the cached
 	 * value isn't as expected, just invalidate it so the next call asks the
 	 * kernel.
 	 */
 	if (reln->smgr_cached_nblocks[forknum] == blocknum)
-		reln->smgr_cached_nblocks[forknum] = blocknum + 1;
+		reln->smgr_cached_nblocks[forknum] = blocknum + count;
 	else
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index f69a92521b..1d7b2181cb 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -26,10 +26,14 @@
  *
  * "typedef struct BulkInsertStateData *BulkInsertState" is in heapam.h
  */
+#define BULK_INSERT_BATCH_SIZE 128
 typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+	int 		local_buffers_idx;
+	Buffer 		local_buffers[BULK_INSERT_BATCH_SIZE];
+	char* 		empty_buffer;
 } BulkInsertStateData;
 
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..8a37dd3ec0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -60,6 +60,7 @@ struct WritebackContext;
 
 /* forward declared, to avoid including smgr.h here */
 struct SMgrRelationData;
+struct BulkInsertStateData;
 
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
@@ -180,6 +181,7 @@ extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
 								 BufferAccessStrategy strategy);
+void ReadBufferExtendBulk(Relation reln, struct BulkInsertStateData* bistate);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 4e1cc12e23..cba2139de3 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -82,7 +82,7 @@ extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
 extern int	FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
+extern int	FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info, bool is_empty);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern off_t FileSize(File file);
 extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 07fd1bb7d0..7df8847737 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -27,7 +27,7 @@ extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, char *buffer, bool skipFsync);
+					 BlockNumber blocknum, char *buffer, bool skipFsync, int count);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..5a6c619cf6 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -90,6 +90,8 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void smgrextend_count(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum, char *buffer, bool skipFsync, int count);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
-- 
2.25.1

>From ef9867757d4ed35831ad7c3808bf36b56ebf4034 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <l...@swarm64.com>
Date: Thu, 31 Dec 2020 13:20:13 +0100
Subject: [PATCH v1 2/2] WIP: buffer alloc specialized for relation extension

---
 src/backend/storage/buffer/bufmgr.c | 280 +++++++++++++++++++++++++++-
 src/include/storage/buf_internals.h |   2 +-
 2 files changed, 275 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d1ce88fee5..14f0923bb5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -474,6 +474,10 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
+static BufferDesc *BufferAllocExtend(SMgrRelation smgr, 
+			char relpersistence, ForkNumber forkNum,
+			BlockNumber blockNum, BufferAccessStrategy strategy,
+			LWLock** lastPartitionLock);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
@@ -996,7 +1000,7 @@ ReadBufferExtendBulk(Relation reln, struct BulkInsertStateData* bistate)
 	BlockNumber	firstBlock,
 				blockNum;
 	Block		bufBlock;
-	bool		found;
+	LWLock*		lastPartitionLock = NULL;
 
 	/* Open it at the smgr level if not already done */
 	RelationOpenSmgr(reln);
@@ -1007,7 +1011,7 @@ ReadBufferExtendBulk(Relation reln, struct BulkInsertStateData* bistate)
 
 	LockRelationForExtension(reln, ExclusiveLock);
 	firstBlock = smgrnblocks(smgr, MAIN_FORKNUM);
-	smgrextend_count(smgr, MAIN_FORKNUM, firstBlock, bistate->empty_buffer, false, BULK_INSERT_BATCH_SIZE);
+	smgrextend_count(smgr, MAIN_FORKNUM, firstBlock, bistate->empty_buffer, true, BULK_INSERT_BATCH_SIZE);
 	UnlockRelationForExtension(reln, ExclusiveLock);
 
 	for (int i=0; i<BULK_INSERT_BATCH_SIZE; ++i)
@@ -1017,12 +1021,10 @@ ReadBufferExtendBulk(Relation reln, struct BulkInsertStateData* bistate)
 		pgstat_count_buffer_read(reln);
 		/* Make sure we will have room to remember the buffer pin */
 		ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
-		bufHdr = BufferAlloc(smgr, relpersistence, MAIN_FORKNUM, blockNum,
-							 bistate->strategy, &found);
+		bufHdr = BufferAllocExtend(smgr, relpersistence, MAIN_FORKNUM, blockNum,
+							 bistate->strategy, &lastPartitionLock);
 		pgBufferUsage.shared_blks_written++;
 
-		Assert(!found);
-
 		bufBlock = BufHdrGetBlock(bufHdr);
 
 		/* new buffers are zero-filled */
@@ -1033,6 +1035,9 @@ ReadBufferExtendBulk(Relation reln, struct BulkInsertStateData* bistate)
 		bistate->local_buffers[i] = BufferDescriptorGetBuffer(bufHdr);
 	}
 
+	Assert(lastPartitionLock);
+	LWLockRelease(lastPartitionLock);
+
 	bistate->local_buffers_idx = 0;
 
 	VacuumPageMiss += BULK_INSERT_BATCH_SIZE;
@@ -1408,6 +1413,269 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	return buf;
 }
 
+static BufferDesc *
+BufferAllocExtend(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+			BlockNumber blockNum, BufferAccessStrategy strategy,
+			LWLock** lastPartitionLock)
+{
+	BufferTag	newTag;			/* identity of requested block */
+	uint32		newHash;		/* hash value for newTag */
+	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	BufferTag	oldTag;			/* previous identity of selected buffer */
+	uint32		oldHash;		/* hash value for oldTag */
+	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
+	uint32		oldFlags;
+	int			buf_id;
+	BufferDesc *buf;
+	uint32		buf_state;
+
+	Assert(lastPartitionLock);
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	newHash = BufTableHashCode(&newTag);
+	newPartitionLock = BufMappingPartitionLock(newHash);
+
+	/* Loop here in case we have to try another victim buffer */
+	for (;;)
+	{
+		/*
+		 * Ensure, while the spinlock's not yet held, that there's a free
+		 * refcount entry.
+		 */
+		ReservePrivateRefCountEntry();
+
+		/*
+		 * Select a victim buffer.  The buffer is returned with its header
+		 * spinlock still held!
+		 */
+		buf = StrategyGetBuffer(strategy, &buf_state);
+
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Must copy buffer flags while we still hold the spinlock */
+		oldFlags = buf_state & BUF_FLAG_MASK;
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf);
+
+		/*
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released it
+		 * above, or even while we are writing it out (since our share-lock
+		 * won't prevent hint-bit updates).  We will recheck the dirty bit
+		 * after re-locking the buffer header.
+		 */
+		if (oldFlags & BM_DIRTY)
+		{
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+										 LW_SHARED))
+			{
+				/*
+				 * If using a nondefault strategy, and writing the buffer
+				 * would require a WAL flush, let the strategy decide whether
+				 * to go ahead and write/reuse the buffer or to choose another
+				 * victim.  We need lock to inspect the page LSN, so this
+				 * can't be done inside StrategyGetBuffer.
+				 */
+				if (strategy != NULL)
+				{
+					XLogRecPtr	lsn;
+
+					/* Read the LSN while holding buffer header lock */
+					buf_state = LockBufHdr(buf);
+					lsn = BufferGetLSN(buf);
+					UnlockBufHdr(buf, buf_state);
+
+					if (XLogNeedsFlush(lsn) &&
+						StrategyRejectBuffer(strategy, buf))
+					{
+						/* Drop lock/pin and loop around for another buffer */
+						LWLockRelease(BufferDescriptorGetContentLock(buf));
+						UnpinBuffer(buf, true);
+						continue;
+					}
+				}
+
+				/* OK, do the I/O */
+				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
+														  smgr->smgr_rnode.node.spcNode,
+														  smgr->smgr_rnode.node.dbNode,
+														  smgr->smgr_rnode.node.relNode);
+
+				FlushBuffer(buf, NULL);
+				LWLockRelease(BufferDescriptorGetContentLock(buf));
+
+				ScheduleBufferTagForWriteback(&BackendWritebackContext,
+											  &buf->tag);
+
+				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
+														 smgr->smgr_rnode.node.spcNode,
+														 smgr->smgr_rnode.node.dbNode,
+														 smgr->smgr_rnode.node.relNode);
+			}
+			else
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf, true);
+				continue;
+			}
+		}
+
+		/*
+		 * To change the association of a valid buffer, we'll need to have
+		 * exclusive lock on both the old and new mapping partitions.
+		 */
+		if (oldFlags & BM_TAG_VALID)
+		{
+			/*
+			 * Need to compute the old tag's hashcode and partition lock ID.
+			 * XXX is it worth storing the hashcode in BufferDesc so we need
+			 * not recompute it here?  Probably not.
+			 */
+			oldTag = buf->tag;
+			oldHash = BufTableHashCode(&oldTag);
+			oldPartitionLock = BufMappingPartitionLock(oldHash);
+
+			if (*lastPartitionLock && 
+				(*lastPartitionLock != oldPartitionLock || 
+				 *lastPartitionLock != newPartitionLock))
+			{
+				LWLockRelease(*lastPartitionLock);
+				*lastPartitionLock = NULL;
+			}
+
+			/*
+			 * Must lock the lower-numbered partition first to avoid
+			 * deadlocks.
+			 */
+			if (oldPartitionLock < newPartitionLock)
+			{
+				Assert(*lastPartitionLock == NULL);
+				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
+				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+			}
+			else if (oldPartitionLock > newPartitionLock)
+			{
+				Assert(*lastPartitionLock == NULL);
+				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
+			}
+			else
+			{
+				/* only one partition, only one lock */
+				if (*lastPartitionLock != newPartitionLock)
+					LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+			}
+		}
+		else
+		{
+			/* if it wasn't valid, we need only the new partition */
+			if (*lastPartitionLock != newPartitionLock)
+			{
+				if (*lastPartitionLock)
+					LWLockRelease(*lastPartitionLock);
+				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+			}
+
+			/* remember we have no old-partition lock or tag */
+			oldPartitionLock = NULL;
+			/* keep the compiler quiet about uninitialized variables */
+			oldHash = 0;
+		}
+
+		*lastPartitionLock = newPartitionLock;
+
+		/*
+		 * Try to make a hashtable entry for the buffer under its new tag.
+		 * This could fail because while we were writing someone else
+		 * allocated another buffer for the same block we want to read in.
+		 * Note that we have not yet removed the hashtable entry for the old
+		 * tag.
+		 */
+		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+		Assert(buf_id < 0);
+
+		/*
+		 * Need to lock the buffer header too in order to change its tag.
+		 */
+		buf_state = LockBufHdr(buf);
+
+		/*
+		 * Somebody could have pinned or re-dirtied the buffer while we were
+		 * doing the I/O and making the new hashtable entry.  If so, we can't
+		 * recycle this buffer; we must undo everything we've done and start
+		 * over with a new victim buffer.
+		 */
+		oldFlags = buf_state & BUF_FLAG_MASK;
+		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
+			break;
+
+		pg_unreachable();
+	}
+
+	/*
+	 * Okay, it's finally safe to rename the buffer.
+	 *
+	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
+	 * paranoia.  We also reset the usage_count since any recency of use of
+	 * the old content is no longer relevant.  (The usage_count starts out at
+	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 *
+	 * Make sure BM_PERMANENT is set for buffers that must be written at every
+	 * checkpoint.  Unlogged buffers only need to be written at shutdown
+	 * checkpoints, except for their "init" forks, which need to be treated
+	 * just like permanent relations.
+	 */
+	buf->tag = newTag;
+	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
+				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
+				   BUF_USAGECOUNT_MASK);
+	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
+		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+	else
+		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+
+	UnlockBufHdr(buf, buf_state);
+
+	if (oldPartitionLock != NULL)
+	{
+		BufTableDelete(&oldTag, oldHash);
+		if (oldPartitionLock != newPartitionLock)
+			LWLockRelease(oldPartitionLock);
+	}
+
+	/*
+	 * Buffer contents are currently invalid.  Try to get the io_in_progress
+	 * lock.  If StartBufferIO returns false, then someone else managed to
+	 * read it before we did, so there's nothing left for BufferAlloc() to do.
+	 */
+	StartBufferIO(buf, true);
+
+	return buf;
+}
+
+
 /*
  * InvalidateBuffer -- mark a shared buffer invalid and return it to the
  * freelist.
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 3377fa5676..277705f18c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -124,7 +124,7 @@ typedef struct buftag
  * NB: NUM_BUFFER_PARTITIONS must be a power of 2!
  */
 #define BufTableHashPartition(hashcode) \
-	((hashcode) % NUM_BUFFER_PARTITIONS)
+	((hashcode >> 7) % NUM_BUFFER_PARTITIONS)
 #define BufMappingPartitionLock(hashcode) \
 	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
 		BufTableHashPartition(hashcode)].lock)
-- 
2.25.1

Reply via email to