Re: Avoiding smgrimmedsync() during nbtree index builds

Melanie Plageman Fri, 04 Mar 2022 14:03:41 -0800

On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pry...@telsasoft.com> wrote:
>
> Rebased to appease cfbot.
>
> I ran these paches under a branch which shows code coverage in cirrus.  It
> looks good to my eyes.
> https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html

thanks for doing that and for the rebase! since I made updates, the
attached version 6 is also rebased.

To Dmitry's question:

On Sun, Feb 13, 2022 at 9:33 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote:
>
> I don't see it in the discussion, so naturally curious -- why directmgr
> is not used for bloom index, e.g. in blbuildempty?

thanks for pointing this out. blbuildempty() is also included now. bloom
doesn't seem to use smgr* anywhere except blbuildempty(), so that is the
only place I made changes in bloom index build.

v6 has the following updates/changes:

- removed erroneous extra calls to unbuffered_prep() and
  unbuffered_finish() in GiST and btree index builds

- removed unnecessary includes

- some comments were updated to accurately reflect use of directmgr

- smgrwrite doesn't WAL-log anymore. This one I'm not sure about. I
  think it makes sense that unbuffered_extend() of non-empty pages of
  WAL-logged relations (or the init fork of unlogged relations) do
  log_newpage(), but I wasn't so sure about smgrwrite().

  Currently all callers of smgrwrite() do log_newpage() and anyone using
  mdwrite() will end up writing the whole page. However, it seems
  possible that a caller of the directmgr API might wish to do a write
  to a particular offset and, either because of that, or, for some other
  reason, require different logging than that done in log_newpage().

  I didn't want to make a separate wrapper function for WAL-logging in
  directmgr because it felt like one more step to forget.

- heapam_relation_set_new_filenode doesn't use directmgr API anymore for
  unlogged relations. It doesn't extend or write, so it seemed like a
  special case better left alone.

  Note that the ambuildempty() functions which also write to the init
  fork of an unlogged relation still use the directmgr API. It is a bit
  confusing because they pass do_wal=true to unbuffered_prep() even
  though they are unlogged relations because they must log and fsync.

- interface changes to unbuffered_prep()
  I removed the parameters to unbuffered_prep() which required the user
  to specify if fsync should be added to pendingOps or done with
  smgrimmedsync(). Understanding all of the combinations of these
  parameters and when they were needed was confusing and the interface
  felt like a foot gun. Special cases shouldn't use this interface.

  I prefer the idea that users of this API expect that
  1) empty pages won't be checksummed or WAL logged
  2) for relations that are WAL-logged, when the build is done, the
  relation will be fsync'd by the backend (unless the fsync optimization
  is being used)
  3) the only case in which fsync requests are added to the pendingOps
  queue is when the fsync optimization is being used (which saves the
  redo pointer and check it later to determine if it needs to fsync
  itself)

  I also added the parameter "do_wal" to unbuffered_prep() and the
  UnBufferedWriteState struct. This is used when extending the file to
  determine whether or not to log_newpage(). unbuffered_extend() and
  unbuffered_write() no longer take do_wal as a parameter.

  Note that callers need to pass do_wal=true/false to unbuffered_prep()
  based on whether or not they want log_newpage() called during
  unbuffered_extend()--not simply based on whether or not the relation
  in question is WAL-logged.

  do_wal is the only member of the UnBufferedWriteState struct in the
  first patch in the set, but I think it makes sense to keep the struct
  around since I anticipate that the patch containing the other members
  needed for the fsync optimization will be committed at the same time.

  One final note on unbuffered_prep() -- I am thinking of renaming
  "do_optimization" to "try_optimization" or maybe
  "request_fsync_optimization". The interface (of unbuffered_prep())
  would be better if we always tried to do the optimization when
  relevant (when the relation is WAL-logged).

- freespace map, visimap, and hash index don't use directmgr API anymore
  For most use cases writing/extending outside shared buffers, it isn't
  safe to rely on requesting fsync from checkpointer.

  visimap, fsm, and hash index all request fsync from checkpointer for
  the pages they write with smgrextend() and don't smgrimmedsync() when
  finished adding pages (even when the hash index is WAL-logged).

  Supporting these exceptions made the interface incoherent, so I cut
  them.

- added unbuffered_extend_range()
  This one is a bit unfortunate. Because GiST index build uses
  log_newpages() to log a whole page range but calls smgrextend() for
  each of those pages individually, I couldn't use the
  unbuffered_extend() function easily.

  So, I thought it might be useful in other contexts as well to have a
  function which calls smgrextend() for a range of pages and then calls
  log_newpages(). I've added this.

  However, there are two parts of GiST index build flush ready pages
  that didn't work with this either.

  The first is that it does an error check on the block numbers one at a
  time while looping through them to write the pages. To retain this
  check, I loop through the ready pages an extra time before calling
  unbuffered_extend(), which is probably not acceptable.

  Also, GiST needs to use a custom LSN for the pages. To achieve this, I
  added a "custom_lsn" parameter to unbuffered_extend_range(). This
  isn't great either. If this was a more common case, I could pass the
  custom LSN to unbuffered_prep().

- Melanie

From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Fri, 4 Mar 2022 15:53:05 -0500
Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization

While building a btree index, the backend can avoid fsync'ing all of the
pages if it uses the optimization introduced in a prior commit.

This can substantially improve performance when many indexes are being
built during DDL operations.
---
 src/backend/access/nbtree/nbtree.c  | 2 +-
 src/backend/access/nbtree/nbtsort.c | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6b78acefbe..fc5cce4603 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true, false);
+	unbuffered_prep(&wstate, true, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d6d0d4b361..f1b9e2e24e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
+	/*
+	 * The fsync optimization done by directmgr is only relevant if
+	 * WAL-logging, so pass btws_use_wal for this parameter.
+	 */
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
-- 
2.30.2

From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Fri, 4 Mar 2022 14:48:39 -0500
Subject: [PATCH v6 1/4] Add unbuffered IO API

Wrap unbuffered extends and writes in a new API, directmgr.

When writing data outside of shared buffers, the backend must do a
series of steps to ensure the data is both durable and recoverable.

When writing or extending a page of data outside of shared buffers the
backend must log the write or extend (if table is WAL-logged), checksum
the page (if it is not empty), and write it out before moving on.

Additionally, the backend must fsync the page data to ensure it reaches
permanent storage since checkpointer is unaware of the buffer and could
move the Redo pointer past the associated WAL for this write/extend
before the data is safely on permanent storage.

This commit introduces no functional change. It replaces many current
callers of smgrimmedsync(), smgrextend(), and smgrwrite() with the
equivalent directmgr functions. Consolidating these steps makes IO
outside of shared buffers less error-prone.
---
 contrib/bloom/blinsert.c               | 25 ++++---
 src/backend/access/gist/gistbuild.c    | 50 +++++--------
 src/backend/access/heap/rewriteheap.c  | 71 ++++++-------------
 src/backend/access/nbtree/nbtree.c     | 26 ++++---
 src/backend/access/nbtree/nbtsort.c    | 73 ++++++++-----------
 src/backend/access/spgist/spginsert.c  | 41 +++++------
 src/backend/catalog/storage.c          | 29 ++------
 src/backend/storage/Makefile           |  2 +-
 src/backend/storage/direct/Makefile    | 17 +++++
 src/backend/storage/direct/directmgr.c | 98 ++++++++++++++++++++++++++
 src/include/storage/directmgr.h        | 53 ++++++++++++++
 11 files changed, 296 insertions(+), 189 deletions(-)
 create mode 100644 src/backend/storage/direct/Makefile
 create mode 100644 src/backend/storage/direct/directmgr.c
 create mode 100644 src/include/storage/directmgr.h

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34e69..7954a17e2d 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -19,8 +19,8 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -164,6 +164,16 @@ void
 blbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	/*
+	 * Though this is an unlogged relation, pass do_wal=true since the init
+	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
+	 * has no effect, as unbuffered_write() does not do log_newpage()
+	 * internally. However, were this to be replaced with unbuffered_extend(),
+	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 */
+	unbuffered_prep(&wstate, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -176,18 +186,13 @@ blbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
-			  (char *) metapage, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			BLOOM_METAPAGE_BLKNO, metapage);
+
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				BLOOM_METAPAGE_BLKNO, metapage, true);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index e081e6571a..fc09938f80 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -91,6 +92,7 @@ typedef struct
 
 	int64		indtuples;		/* number of tuples indexed */
 
+	UnBufferedWriteState ub_wstate;
 	/*
 	 * Extra data structures used during a buffering build. 'gfbb' contains
 	 * information related to managing the build buffers. 'parentMap' is a
@@ -410,13 +412,15 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
 
+	unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel));
+
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
 	 * replaced with the real root page at the end.
 	 */
 	page = palloc0(BLCKSZ);
-	smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			   page, true);
+	unbuffered_extend(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM, GIST_ROOT_BLKNO, page, true);
 	state->pages_allocated++;
 	state->pages_written++;
 
@@ -458,27 +462,19 @@ gist_indexsortbuild(GISTBuildState *state)
 
 	/* Write out the root */
 	PageSetLSN(levelstate->pages[0], GistBuildLSN);
-	PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
-			  levelstate->pages[0], true);
+
+	unbuffered_write(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM, GIST_ROOT_BLKNO, levelstate->pages[0]);
+
 	if (RelationNeedsWAL(state->indexrel))
 		log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
 					levelstate->pages[0], true);
 
+	unbuffered_finish(&state->ub_wstate, RelationGetSmgr(state->indexrel),
+			MAIN_FORKNUM);
+
 	pfree(levelstate->pages[0]);
 	pfree(levelstate);
-
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (RelationNeedsWAL(state->indexrel))
-		smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM);
 }
 
 /*
@@ -645,26 +641,18 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state)
 	if (state->ready_num_pages == 0)
 		return;
 
+	/* Currently, the blocks must be buffered in order. */
 	for (int i = 0; i < state->ready_num_pages; i++)
 	{
-		Page		page = state->ready_pages[i];
-		BlockNumber blkno = state->ready_blknos[i];
-
-		/* Currently, the blocks must be buffered in order. */
-		if (blkno != state->pages_written)
+		if (state->ready_blknos[i] != state->pages_written)
 			elog(ERROR, "unexpected block number to flush GiST sorting build");
-
-		PageSetLSN(page, GistBuildLSN);
-		PageSetChecksumInplace(page, blkno);
-		smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page,
-				   true);
-
 		state->pages_written++;
 	}
 
-	if (RelationNeedsWAL(state->indexrel))
-		log_newpages(&state->indexrel->rd_node, MAIN_FORKNUM, state->ready_num_pages,
-					 state->ready_blknos, state->ready_pages, true);
+	unbuffered_extend_range(&state->ub_wstate,
+			RelationGetSmgr(state->indexrel), MAIN_FORKNUM,
+			state->ready_num_pages, state->ready_blknos, state->ready_pages,
+			false, GistBuildLSN);
 
 	for (int i = 0; i < state->ready_num_pages; i++)
 		pfree(state->ready_pages[i]);
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..c8fa8bb27c 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -81,15 +81,15 @@
  * in the whole table.  Note that if we do fail halfway through a CLUSTER,
  * the old table is still valid, so failure is not catastrophic.
  *
- * We can't use the normal heap_insert function to insert into the new
- * heap, because heap_insert overwrites the visibility information.
- * We use a special-purpose raw_heap_insert function instead, which
- * is optimized for bulk inserting a lot of tuples, knowing that we have
- * exclusive access to the heap.  raw_heap_insert builds new pages in
- * local storage.  When a page is full, or at the end of the process,
- * we insert it to WAL as a single record and then write it to disk
- * directly through smgr.  Note, however, that any data sent to the new
- * heap's TOAST table will go through the normal bufmgr.
+ * We can't use the normal heap_insert function to insert into the new heap,
+ * because heap_insert overwrites the visibility information. We use a
+ * special-purpose raw_heap_insert function instead, which is optimized for
+ * bulk inserting a lot of tuples, knowing that we have exclusive access to the
+ * heap.  raw_heap_insert builds new pages in local storage.  When a page is
+ * full, or at the end of the process, we insert it to WAL as a single record
+ * and then write it to disk directly through directmgr.  Note, however, that
+ * any data sent to the new heap's TOAST table will go through the normal
+ * bufmgr.
  *
  *
  * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
@@ -119,9 +119,9 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "storage/bufmgr.h"
+#include "storage/directmgr.h"
 #include "storage/fd.h"
 #include "storage/procarray.h"
-#include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -152,6 +152,7 @@ typedef struct RewriteStateData
 	HTAB	   *rs_old_new_tid_map; /* unmatched B tuples */
 	HTAB	   *rs_logical_mappings;	/* logical remapping files */
 	uint32		rs_num_rewrite_mappings;	/* # in memory mappings */
+	UnBufferedWriteState rs_unbuffered_wstate;
 }			RewriteStateData;
 
 /*
@@ -265,6 +266,10 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cutoff_multi = cutoff_multi;
 	state->rs_cxt = rw_cxt;
 
+	unbuffered_prep(&state->rs_unbuffered_wstate,
+			RelationNeedsWAL(state->rs_new_rel));
+
+
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
 	hash_ctl.entrysize = sizeof(UnresolvedTupData);
@@ -317,28 +322,13 @@ end_heap_rewrite(RewriteState state)
 	/* Write the last page, if any */
 	if (state->rs_buffer_valid)
 	{
-		if (RelationNeedsWAL(state->rs_new_rel))
-			log_newpage(&state->rs_new_rel->rd_node,
-						MAIN_FORKNUM,
-						state->rs_blockno,
-						state->rs_buffer,
-						true);
-
-		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
-
-		smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-				   state->rs_blockno, (char *) state->rs_buffer, true);
+		unbuffered_extend(&state->rs_unbuffered_wstate,
+				RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+				state->rs_blockno, state->rs_buffer, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is the same as in storage.c's RelationCopyStorage(): we're
-	 * writing data that's not in shared buffers, and so a CHECKPOINT
-	 * occurring during the rewriteheap operation won't have fsync'd data we
-	 * wrote before the checkpoint.
-	 */
-	if (RelationNeedsWAL(state->rs_new_rel))
-		smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
+	unbuffered_finish(&state->rs_unbuffered_wstate,
+			RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
 
 	logical_end_heap_rewrite(state);
 
@@ -676,24 +666,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			 * contains a tuple.  Hence, unlike RelationGetBufferForTuple(),
 			 * enforce saveFreeSpace unconditionally.
 			 */
-
-			/* XLOG stuff */
-			if (RelationNeedsWAL(state->rs_new_rel))
-				log_newpage(&state->rs_new_rel->rd_node,
-							MAIN_FORKNUM,
-							state->rs_blockno,
-							page,
-							true);
-
-			/*
-			 * Now write the page. We say skipFsync = true because there's no
-			 * need for smgr to schedule an fsync for this write; we'll do it
-			 * ourselves in end_heap_rewrite.
-			 */
-			PageSetChecksumInplace(page, state->rs_blockno);
-
-			smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
-					   state->rs_blockno, (char *) page, true);
+			unbuffered_extend(&state->rs_unbuffered_wstate,
+					RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,
+					state->rs_blockno, page, false);
 
 			state->rs_blockno++;
 			state->rs_buffer_valid = false;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964c1e..c7bf971917 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -30,10 +30,10 @@
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
+#include "storage/directmgr.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
-#include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -152,6 +152,16 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	UnBufferedWriteState wstate;
+
+	/*
+	 * Though this is an unlogged relation, pass do_wal=true since the init
+	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
+	 * has no effect, as unbuffered_write() does not do log_newpage()
+	 * internally. However, were this to be replaced with unbuffered_extend(),
+	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 */
+	unbuffered_prep(&wstate, true);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
@@ -164,18 +174,12 @@ btbuildempty(Relation index)
 	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
 	 * this even when wal_level=minimal.
 	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			BTREE_METAPAGE, metapage);
 	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
 				BTREE_METAPAGE, metapage, true);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
@@ -959,7 +963,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * delete some deletable tuples.  Hence, we must repeatedly check the
 	 * relation length.  We must acquire the relation-extension lock while
 	 * doing so to avoid a race condition: if someone else is extending the
-	 * relation, there is a window where bufmgr/smgr have created a new
+	 * relation, there is a window where bufmgr/directmgr have created a new
 	 * all-zero page but it hasn't yet been write-locked by _bt_getbuf(). If
 	 * we manage to scan such a page here, we'll improperly assume it can be
 	 * recycled.  Taking the lock synchronizes things enough to prevent a
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8a19de2f66..e280253127 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,13 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but
- * that is of no value (since other backends have no interest in them yet)
- * and it created locking problems for CHECKPOINT, because the upper-level
- * pages were held exclusive-locked for long periods.  Now we just build
- * the pages in local memory and smgrwrite or smgrextend them as we finish
- * them.  They will need to be re-read into shared buffers on first use after
- * the build finishes.
+ * Formerly the index pages being built were kept in shared buffers, but that
+ * is of no value (since other backends have no interest in them yet) and it
+ * created locking problems for CHECKPOINT, because the upper-level pages were
+ * held exclusive-locked for long periods.  Now we just build the pages in
+ * local memory and write or extend them with directmgr as we finish them.
+ * They will need to be re-read into shared buffers on first use after the
+ * build finishes.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -57,7 +57,7 @@
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "storage/smgr.h"
+#include "storage/directmgr.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
@@ -256,6 +256,7 @@ typedef struct BTWriteState
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
+	UnBufferedWriteState ub_wstate;
 } BTWriteState;
 
 
@@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level)
 static void
 _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 {
-	/* XLOG stuff */
-	if (wstate->btws_use_wal)
-	{
-		/* We use the XLOG_FPI record type for this */
-		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-	}
-
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
 	 * zeroes until we come back and overwrite.  This is not logically
@@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	{
 		if (!wstate->btws_zeropage)
 			wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
-				   wstate->btws_pages_written++,
-				   (char *) wstate->btws_zeropage,
-				   true);
+
+		unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+				MAIN_FORKNUM, wstate->btws_pages_written++,
+				wstate->btws_zeropage, true);
 	}
 
-	PageSetChecksumInplace(page, blkno);
 
-	/*
-	 * Now write the page.  There's no need for smgr to schedule an fsync for
-	 * this write; we'll do it ourselves before ending the build.
-	 */
+	/* Now write the page. Either we are extending the file... */
 	if (blkno == wstate->btws_pages_written)
 	{
-		/* extending the file... */
-		smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				   (char *) page, true);
+		unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+				MAIN_FORKNUM, blkno, page, false);
+
 		wstate->btws_pages_written++;
 	}
+
+	/* or we are overwriting a block we zero-filled before. */
 	else
 	{
-		/* overwriting a block we zero-filled before */
-		smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno,
-				  (char *) page, true);
-	}
+		unbuffered_write(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+				MAIN_FORKNUM, blkno, page);
+
+		/* We use the XLOG_FPI record type for this */
+		if (wstate->btws_use_wal)
+			log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
 
+	}
 	pfree(page);
 }
 
@@ -1195,6 +1189,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal);
+
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
@@ -1421,17 +1417,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
-	/*
-	 * When we WAL-logged index pages, we must nonetheless fsync index files.
-	 * Since we're building outside shared buffers, a CHECKPOINT occurring
-	 * during the build has no way to flush the previously written data to
-	 * disk (indeed it won't know the index even exists).  A crash later on
-	 * would replay WAL from the checkpoint, therefore it wouldn't replay our
-	 * earlier WAL entries. If we do not fsync those pages here, they might
-	 * still not be on disk when the crash occurs.
-	 */
-	if (wstate->btws_use_wal)
-		smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
+	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
+			MAIN_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bfb74049d0..318dbee823 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -25,7 +25,7 @@
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/smgr.h"
+#include "storage/directmgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -156,48 +156,43 @@ void
 spgbuildempty(Relation index)
 {
 	Page		page;
+	UnBufferedWriteState wstate;
+
+	/*
+	 * Though this is an unlogged relation, pass do_wal=true since the init
+	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
+	 * has no effect, as unbuffered_write() does not do log_newpage()
+	 * internally. However, were this to be replaced with unbuffered_extend(),
+	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 */
+	unbuffered_prep(&wstate, true);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
 	SpGistInitMetapage(page);
 
-	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases whose
-	 * creation happened after the last redo pointer as recovery removes any
-	 * of their existing content when the corresponding create records are
-	 * replayed.
-	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_METAPAGE_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_METAPAGE_BLKNO, page, true);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
 
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_ROOT_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_ROOT_BLKNO, page, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
 
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
+	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
+			SPGIST_NULL_BLKNO, page);
 	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 				SPGIST_NULL_BLKNO, page, true);
 
-	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..0b211895c1 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "miscadmin.h"
+#include "storage/directmgr.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
@@ -420,6 +421,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	bool		copying_initfork;
 	BlockNumber nblocks;
 	BlockNumber blkno;
+	UnBufferedWriteState wstate;
+
 
 	page = (Page) buf.data;
 
@@ -440,6 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
+	unbuffered_prep(&wstate, use_wal);
+
 	nblocks = smgrnblocks(src, forkNum);
 
 	for (blkno = 0; blkno < nblocks; blkno++)
@@ -474,30 +479,10 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 		 * page this is, so we have to log the full page including any unused
 		 * space.
 		 */
-		if (use_wal)
-			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
-
-		PageSetChecksumInplace(page, blkno);
-
-		/*
-		 * Now write the page.  We say skipFsync = true because there's no
-		 * need for smgr to schedule an fsync for this write; we'll do it
-		 * ourselves below.
-		 */
-		smgrextend(dst, forkNum, blkno, buf.data, true);
+		unbuffered_extend(&wstate, dst, forkNum, blkno, page, false);
 	}
 
-	/*
-	 * When we WAL-logged rel pages, we must nonetheless fsync them.  The
-	 * reason is that since we're copying outside shared buffers, a CHECKPOINT
-	 * occurring during the copy has no way to flush the previously written
-	 * data to disk (indeed it won't know the new rel even exists).  A crash
-	 * later on would replay WAL from the checkpoint, therefore it wouldn't
-	 * replay our earlier WAL entries. If we do not fsync those pages here,
-	 * they might still not be on disk when the crash occurs.
-	 */
-	if (use_wal || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+	unbuffered_finish(&wstate, dst, forkNum);
 }
 
 /*
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..501fae5f9d 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = buffer direct file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile
new file mode 100644
index 0000000000..d82bbed48c
--- /dev/null
+++ b/src/backend/storage/direct/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/direct
+#
+# IDENTIFICATION
+#    src/backend/storage/direct/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/direct
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = directmgr.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
new file mode 100644
index 0000000000..42c37daa7a
--- /dev/null
+++ b/src/backend/storage/direct/directmgr.c
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.c
+ *	  routines for managing unbuffered IO
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/direct/directmgr.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+
+#include "access/xlogdefs.h"
+#include "access/xloginsert.h"
+#include "storage/directmgr.h"
+
+void
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
+{
+	wstate->do_wal = do_wal;
+}
+
+void
+unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation
+		smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool
+		empty)
+{
+	/*
+	 * Don't checksum empty pages
+	 */
+	if (!empty)
+		PageSetChecksumInplace(page, blocknum);
+
+	smgrextend(smgrrel, forknum, blocknum, (char *) page, true);
+
+	/*
+	 * Don't WAL-log empty pages
+	 */
+	if (!empty && wstate->do_wal)
+		log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
+					blocknum, page, true);
+}
+
+void
+unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages,
+		bool empty, XLogRecPtr custom_lsn)
+{
+	for (int i = 0; i < num_pages; i++)
+	{
+		Page		page = pages[i];
+		BlockNumber blkno = blocknums[i];
+
+		if (!XLogRecPtrIsInvalid(custom_lsn))
+			PageSetLSN(page, custom_lsn);
+
+		if (!empty)
+			PageSetChecksumInplace(page, blkno);
+
+		smgrextend(smgrrel, forknum, blkno, (char *) page, true);
+	}
+
+	if (!empty && wstate->do_wal)
+		log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages,
+				blocknums, pages, true);
+}
+
+void
+unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber
+		forknum, BlockNumber blocknum, Page page)
+{
+	PageSetChecksumInplace(page, blocknum);
+
+	smgrwrite(smgrrel, forknum, blocknum, (char *) page, true);
+}
+
+/*
+ * When writing data outside shared buffers, a concurrent CHECKPOINT can move
+ * the redo pointer past our WAL entries and won't flush our data to disk. If
+ * the database crashes before the data makes it to disk, our WAL won't be
+ * replayed and the data will be lost.
+ * Thus, if a CHECKPOINT begins between unbuffered_prep() and
+ * unbuffered_finish(), the backend must fsync the data itself.
+ */
+void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum)
+{
+	if (!wstate->do_wal)
+		return;
+
+	smgrimmedsync(smgrrel, forknum);
+}
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
new file mode 100644
index 0000000000..db5e3b1cac
--- /dev/null
+++ b/src/include/storage/directmgr.h
@@ -0,0 +1,53 @@
+/*-------------------------------------------------------------------------
+ *
+ * directmgr.h
+ *	  POSTGRES unbuffered IO manager definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/directmgr.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DIRECTMGR_H
+#define DIRECTMGR_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/smgr.h"
+
+typedef struct UnBufferedWriteState
+{
+	/*
+	 * When writing WAL-logged relation data outside of shared buffers, there
+	 * is a risk of a concurrent CHECKPOINT moving the redo pointer past the
+	 * data's associated WAL entries. To avoid this, callers in this situation
+	 * must fsync the pages they have written themselves. This is necessary
+	 * only if the relation is WAL-logged or in special cases such as the init
+	 * fork of an unlogged index.
+	 */
+	bool do_wal;
+} UnBufferedWriteState;
+/*
+ * prototypes for functions in directmgr.c
+ */
+extern void
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);
+extern void
+unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum, BlockNumber blocknum, Page page, bool empty);
+extern void
+unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages,
+		bool empty, XLogRecPtr custom_lsn);
+extern void
+unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber
+		forknum, BlockNumber blocknum, Page page);
+extern void
+unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
+		ForkNumber forknum);
+
+#endif							/* DIRECTMGR_H */
-- 
2.30.2

From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Fri, 4 Mar 2022 15:52:45 -0500
Subject: [PATCH v6 4/4] Use shared buffers when possible for index build

When there are not too many tuples, building the index in shared buffers
makes sense. It allows the buffer manager to handle how best to do the
IO.
---
 src/backend/access/nbtree/nbtree.c  |  32 ++--
 src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++-------
 2 files changed, 223 insertions(+), 82 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fc5cce4603..d3982b9388 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -152,34 +152,24 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
-	UnBufferedWriteState wstate;
+	Buffer metabuf;
 
 	/*
-	 * Though this is an unlogged relation, pass do_wal=true since the init
-	 * fork of an unlogged index must be wal-logged and fsync'd. This currently
-	 * has no effect, as unbuffered_write() does not do log_newpage()
-	 * internally. However, were this to be replaced with unbuffered_extend(),
-	 * do_wal must be true to ensure the data is logged and fsync'd.
+	 * Allocate a buffer for metapage and initialize metapage.
 	 */
-	unbuffered_prep(&wstate, true, true);
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
+	metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK,
+			NULL);
+	metapage = BufferGetPage(metabuf);
 	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
-	 * Write the page and log it.  It might seem that an immediate sync would
-	 * be sufficient to guarantee that the file exists on disk, but recovery
-	 * itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
-	 * this even when wal_level=minimal.
+	 * Mark metapage buffer as dirty and XLOG it
 	 */
-	unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
-			BTREE_METAPAGE, metapage);
-	log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, true);
-
-	unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
+	START_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
+	log_newpage_buffer(metabuf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(index, metabuf);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f1b9e2e24e..35ded6e4a1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -23,13 +23,15 @@
  * many upper pages if the keys are reasonable-size) without risking a lot of
  * cascading splits during early insertions.
  *
- * Formerly the index pages being built were kept in shared buffers, but that
- * is of no value (since other backends have no interest in them yet) and it
- * created locking problems for CHECKPOINT, because the upper-level pages were
- * held exclusive-locked for long periods.  Now we just build the pages in
- * local memory and write or extend them with directmgr as we finish them.
- * They will need to be re-read into shared buffers on first use after the
- * build finishes.
+ * If indexing little enough data, it can be built efficiently in shared
+ * buffers, leaving checkpointer to deal with fsync'ing the data at its
+ * convenience. However, if indexing many pages, building the index in shared
+ * buffers could delay CHECKPOINTs--since the upper-level pages are
+ * exclusive-locked for long periods. In that case, build the pages in local
+ * memory and write or extend them with directmgr as they are finished. If a
+ * CHECKPOINT has begun since the build started, directmgr will fsync the
+ * relation file itself after finishing the build. The index will then need to
+ * be re-read into shared buffers on first use after the build finishes.
  *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
@@ -236,6 +238,7 @@ typedef struct BTPageState
 {
 	Page		btps_page;		/* workspace for page building */
 	BlockNumber btps_blkno;		/* block # to write this page at */
+	Buffer btps_buf; /* buffer to write this page to */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
 	Size		btps_lastextra; /* last item's extra posting list space */
@@ -253,10 +256,26 @@ typedef struct BTWriteState
 	Relation	index;
 	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
-	BlockNumber btws_pages_alloced; /* # pages allocated */
-	BlockNumber btws_pages_written; /* # pages written out */
+	BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */
+	BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */
 	Page		btws_zeropage;	/* workspace for filling zeroes */
 	UnBufferedWriteState ub_wstate;
+	/*
+	 * Allocate a new btree page. This does not initialize the page.
+	 */
+	Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber
+			*blockno, Buffer *buf);
+	/*
+	 * Emit a completed btree page, and release the working storage.
+	 */
+	void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page,
+			BlockNumber blkno, Buffer buf);
+
+	void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate, bool do_wal,
+			bool do_optimization);
+
+	void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate,
+			SMgrRelation smgrrel, ForkNumber forknum);
 } BTWriteState;
 
 
@@ -265,10 +284,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
-static Page _bt_blnewpage(uint32 level);
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf);
+
+static Page _bt_blnewpage(uint32 level, Page page);
+
+static void
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf);
+
 static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page rightmostpage);
 static void _bt_sortaddtup(Page page, Size itemsize,
@@ -279,9 +310,10 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
-static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
+static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer
+		metabuf, Page metapage);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
 							   int request);
 static void _bt_end_parallel(BTLeader *btleader);
@@ -294,6 +326,21 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 									   Sharedsort *sharedsort2, int sortmem,
 									   bool progress);
 
+#define BT_BUILD_SB_THRESHOLD 1024
+
+static const BTWriteState wstate_shared = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_shared,
+	._bt_blwritepage = _bt_blwritepage_shared,
+	._bt_bl_unbuffered_prep = NULL,
+	._bt_bl_unbuffered_finish = NULL,
+};
+
+static const BTWriteState wstate_direct = {
+	._bt_bl_alloc_page = _bt_bl_alloc_page_direct,
+	._bt_blwritepage = _bt_blwritepage_direct,
+	._bt_bl_unbuffered_prep = unbuffered_prep,
+	._bt_bl_unbuffered_finish = unbuffered_finish,
+};
 
 /*
  *	btbuild() -- build a new btree index.
@@ -303,6 +350,7 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	BTBuildState buildstate;
+	BTWriteState writestate;
 	double		reltuples;
 
 #ifdef BTREE_BUILD_STATS
@@ -333,8 +381,12 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
+	 *
+	 * Don't use shared buffers if the number of tuples is too large.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	writestate = reltuples < BT_BUILD_SB_THRESHOLD ? wstate_shared : wstate_direct;
+
+	_bt_leafbuild(&writestate, buildstate.spool, buildstate.spool2);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
@@ -542,10 +594,8 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
-	BTWriteState wstate;
-
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
 	{
@@ -565,21 +615,38 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 		tuplesort_performsort(btspool2->sortstate);
 	}
 
-	wstate.heap = btspool->heap;
-	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-	/* _bt_mkscankey() won't set allequalimage without metapage */
-	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
+	wstate->heap = btspool->heap;
+	wstate->index = btspool->index;
+	wstate->inskey = _bt_mkscankey(wstate->index, NULL);
+	/* _bt-mkscankey() won't set allequalimage without metapage */
+	wstate->inskey->allequalimage = _bt_allequalimage(wstate->index, true);
+	wstate->btws_use_wal = RelationNeedsWAL(wstate->index);
 
 	/* reserve the metapage */
-	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
-	wstate.btws_pages_written = 0;
-	wstate.btws_zeropage = NULL;	/* until needed */
+	wstate->btws_pages_alloced = 0;
+	wstate->btws_pages_written = 0;
+	wstate->btws_zeropage = NULL;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	_bt_load(&wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, save the redo
+	 * pointer location in case a checkpoint begins during the index build.
+	 */
+	if (wstate->_bt_bl_unbuffered_prep)
+		wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
+				wstate->btws_use_wal, wstate->btws_use_wal);
+
+	_bt_load(wstate, btspool, btspool2);
+
+	/*
+	 * If not using shared buffers, for a WAL-logged relation, check if backend
+	 * must fsync the page itself.
+	 */
+	if (wstate->_bt_bl_unbuffered_finish)
+		wstate->_bt_bl_unbuffered_finish(&wstate->ub_wstate,
+				RelationGetSmgr(wstate->index), MAIN_FORKNUM);
 }
 
 /*
@@ -612,15 +679,15 @@ _bt_build_callback(Relation index,
 }
 
 /*
- * allocate workspace for a new, clean btree page, not linked to any siblings.
+ * Set up workspace for a new, clean btree page, not linked to any siblings.
+ * Caller must allocate the passed in page.
  */
 static Page
-_bt_blnewpage(uint32 level)
+_bt_blnewpage(uint32 level, Page page)
 {
-	Page		page;
 	BTPageOpaque opaque;
 
-	page = (Page) palloc(BLCKSZ);
+	Assert(page);
 
 	/* Zero the page and set up standard page header info */
 	_bt_pageinit(page, BLCKSZ);
@@ -638,11 +705,8 @@ _bt_blnewpage(uint32 level)
 	return page;
 }
 
-/*
- * emit a completed btree page, and release the working storage.
- */
 static void
-_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
+_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
 {
 	/*
 	 * If we have to write pages nonsequentially, fill in the space with
@@ -685,6 +749,61 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	pfree(page);
 }
 
+static void
+_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
+{
+	/*
+	 * Indexes built in shared buffers need only to mark the buffer as dirty
+	 * and XLOG it.
+	 */
+	Assert(buf);
+	START_CRIT_SECTION();
+	MarkBufferDirty(buf);
+	if (wstate->btws_use_wal)
+		log_newpage_buffer(buf, true);
+	END_CRIT_SECTION();
+	_bt_relbuf(wstate->index, buf);
+}
+
+static Page
+_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	 /* buf is only used when using shared buffers, so set it to InvalidBuffer */
+	*buf = InvalidBuffer;
+
+	/*
+	 * Assign block number for the page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = wstate->btws_pages_alloced++;
+
+	/* now allocate and set up the new page */
+	return palloc(BLCKSZ);
+}
+
+static Page
+_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf)
+{
+	/*
+	 * Find a shared buffer for the page. Pass mode RBM_ZERO_AND_LOCK to get an
+	 * exclusive lock on the buffer content. No lock on the relation as a whole
+	 * is needed (as in LockRelationForExtension()) because the initial index
+	 * build is not yet complete.
+	 */
+	*buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW,
+			RBM_ZERO_AND_LOCK, NULL);
+
+	/*
+	 * bufmgr will assign a block number for the new page.
+	 * This will be used to link to sibling page(s) later and, if this is the
+	 * initial page in the level, saved in the BTPageState
+	 */
+	*blockno = BufferGetBlockNumber(*buf);
+
+	return BufferGetPage(*buf);
+}
+
 /*
  * allocate and initialize a new BTPageState.  the returned structure
  * is suitable for immediate use by _bt_buildadd.
@@ -692,13 +811,20 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 static BTPageState *
 _bt_pagestate(BTWriteState *wstate, uint32 level)
 {
+	Buffer buf;
+	BlockNumber blockno;
+
 	BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
 
-	/* create initial page for level */
-	state->btps_page = _bt_blnewpage(level);
+	/*
+	 * Allocate and initialize initial page for the level, and if using shared
+	 * buffers, extend the relation and allocate a shared buffer for the block.
+	 */
+	state->btps_page = _bt_blnewpage(level, wstate->_bt_bl_alloc_page(wstate,
+				&blockno, &buf));
 
-	/* and assign it a page position */
-	state->btps_blkno = wstate->btws_pages_alloced++;
+	state->btps_blkno = blockno;
+	state->btps_buf = buf;
 
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
@@ -833,6 +959,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 {
 	Page		npage;
 	BlockNumber nblkno;
+	Buffer nbuf;
 	OffsetNumber last_off;
 	Size		last_truncextra;
 	Size		pgspc;
@@ -847,6 +974,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
+	nbuf = state->btps_buf;
 	last_off = state->btps_lastoff;
 	last_truncextra = state->btps_lastextra;
 	state->btps_lastextra = truncextra;
@@ -903,15 +1031,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 		 */
 		Page		opage = npage;
 		BlockNumber oblkno = nblkno;
+		Buffer obuf = nbuf;
 		ItemId		ii;
 		ItemId		hii;
 		IndexTuple	oitup;
 
-		/* Create new page of same level */
-		npage = _bt_blnewpage(state->btps_level);
-
-		/* and assign it a page position */
-		nblkno = wstate->btws_pages_alloced++;
+		/* Create and initialize a new page of same level */
+		npage = _bt_blnewpage(state->btps_level,
+				wstate->_bt_bl_alloc_page(wstate, &nblkno, &nbuf));
 
 		/*
 		 * We copy the last item on the page into the new page, and then
@@ -1021,9 +1148,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 		/*
 		 * Write out the old page.  We never need to touch it again, so we can
-		 * free the opage workspace too.
+		 * free the opage workspace too. obuf has been released and is no longer
+		 * valid.
 		 */
-		_bt_blwritepage(wstate, opage, oblkno);
+		 wstate->_bt_blwritepage(wstate, opage, oblkno, obuf);
+		 obuf = InvalidBuffer;
+		 opage = NULL;
 
 		/*
 		 * Reset last_off to point to new page
@@ -1058,6 +1188,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
 
 	state->btps_page = npage;
 	state->btps_blkno = nblkno;
+	state->btps_buf = nbuf;
 	state->btps_lastoff = last_off;
 }
 
@@ -1103,12 +1234,12 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
  * Finish writing out the completed btree.
  */
 static void
-_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
+_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf,
+		Page metapage)
 {
 	BTPageState *s;
 	BlockNumber rootblkno = P_NONE;
 	uint32		rootlevel = 0;
-	Page		metapage;
 
 	/*
 	 * Each iteration of this loop completes one more level of the tree.
@@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		 * back one slot.  Then we can dump out the page.
 		 */
 		_bt_slideleft(s->btps_page);
-		_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
+		wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
+		s->btps_buf = InvalidBuffer;
 		s->btps_page = NULL;	/* writepage freed the workspace */
 	}
 
 	/*
-	 * As the last step in the process, construct the metapage and make it
+	 * As the last step in the process, initialize the metapage and make it
 	 * point to the new root (unless we had no data at all, in which case it's
 	 * set to point to "P_NONE").  This changes the index to the "valid" state
 	 * by filling in a valid magic number in the metapage.
+	 * After this, metapage will have been freed or invalid and metabuf, if ever
+	 * valid, will have been released.
 	 */
-	metapage = (Page) palloc(BLCKSZ);
 	_bt_initmetapage(metapage, rootblkno, rootlevel,
 					 wstate->inskey->allequalimage);
-	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
+	wstate->_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf);
+	metabuf = InvalidBuffer;
+	metapage = NULL;
 }
 
 /*
@@ -1177,6 +1312,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 static void
 _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 {
+	Page metapage;
+	BlockNumber metablkno;
+	Buffer metabuf;
+
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
@@ -1189,15 +1328,30 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-	/*
-	 * The fsync optimization done by directmgr is only relevant if
-	 * WAL-logging, so pass btws_use_wal for this parameter.
-	 */
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal);
-
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
 
+	/*
+	 * Reserve block 0 for the metapage up front.
+	 *
+	 * When using the shared buffers API it is easier to allocate the buffer
+	 * for block 0 first instead of trying skip block 0 and allocate it at the
+	 * end of index build.
+	 *
+	 * When not using the shared buffers API, there is no harm in allocating
+	 * the metapage first. When block 1 is written, the direct writepage
+	 * function will zero-fill block 0. When writing out the metapage at the
+	 * end of index build, it will overwrite that block 0.
+	 *
+	 * The metapage will be initialized and written out at the end of the index
+	 * build when all of the information needed to do so is available.
+	 *
+	 * The block number will always be BTREE_METAPAGE, so the metablkno
+	 * variable is unused and only created to avoid a special case in the
+	 * direct alloc function.
+	 */
+	metapage = wstate->_bt_bl_alloc_page(wstate, &metablkno, &metabuf);
+
 	if (merge)
 	{
 		/*
@@ -1419,10 +1573,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 
 	/* Close down final pages and write the metapage */
-	_bt_uppershutdown(wstate, state);
-
-	unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
-			MAIN_FORKNUM);
+	_bt_uppershutdown(wstate, state, metabuf, metapage);
 }
 
 /*
-- 
2.30.2

From ac914bf0e227b3cb62918954a7d7567a3f09f3b7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Fri, 4 Mar 2022 14:14:21 -0500
Subject: [PATCH v6 2/4] Avoid immediate fsync for unbuffered IO

Data written to WAL-logged relations is durable once the WAL entries are
on permanent storage; however, the XLOG Redo pointer cannot be moved
past the associated WAL until the page data is safely on permanent
storage. If a crash were to occur before the data is fsync'd, the WAL
wouldn't be replayed during recovery, and the data would be lost.

This is not a problem with pages written in shared buffers because the
checkpointer will block until FlushBuffer() is complete for all buffers
that were dirtied before it began. Therefore it will not move the Redo
pointer past their associated WAL entries until it has fsync'd the data.

A backend writing data outside of shared buffers must ensure that the
data has reached permanent storage itself or that the Redo pointer has
not moved while it was writing the data.

In the common case, the backend should not have to do this fsync itself
and can instead request the checkpointer do it.

To ensure this is safe, the backend can save the XLOG Redo pointer
location before doing the write or extend. Then it can add an fsync
request for the page to the checkpointer's pending-ops table using the
existing mechanism. After doing the write or extend, if the Redo pointer
has moved (meaning a checkpoint has started since it saved it last),
then the backend can simply fsync the page itself. Otherwise, the
checkpointer takes care of fsync'ing the page the next time it processes
the pending-ops table.

This commit adds the optimization option to the directmgr API but does
not add any users, so there is no behavior change.
---
 contrib/bloom/blinsert.c               |  2 +-
 src/backend/access/gist/gistbuild.c    |  2 +-
 src/backend/access/heap/rewriteheap.c  |  3 +--
 src/backend/access/nbtree/nbtree.c     |  2 +-
 src/backend/access/nbtree/nbtsort.c    |  2 +-
 src/backend/access/spgist/spginsert.c  |  2 +-
 src/backend/access/transam/xlog.c      | 13 +++++++++++++
 src/backend/catalog/storage.c          |  2 +-
 src/backend/storage/direct/directmgr.c | 14 +++++++++++++-
 src/include/access/xlog.h              |  1 +
 src/include/storage/directmgr.h        | 11 ++++++++++-
 11 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 7954a17e2d..fdde2bd88a 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -173,7 +173,7 @@ blbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true);
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index fc09938f80..8de19199a6 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -412,7 +412,7 @@ gist_indexsortbuild(GISTBuildState *state)
 	state->pages_written = 0;
 	state->ready_num_pages = 0;
 
-	unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel));
+	unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel), false);
 
 	/*
 	 * Write an empty page as a placeholder for the root page. It will be
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c8fa8bb27c..70b7a0f269 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -267,8 +267,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	state->rs_cxt = rw_cxt;
 
 	unbuffered_prep(&state->rs_unbuffered_wstate,
-			RelationNeedsWAL(state->rs_new_rel));
-
+			RelationNeedsWAL(state->rs_new_rel), false);
 
 	/* Initialize hash tables used to track update chains */
 	hash_ctl.keysize = sizeof(TidHashKey);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c7bf971917..6b78acefbe 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -161,7 +161,7 @@ btbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true);
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e280253127..d6d0d4b361 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1189,7 +1189,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	int64		tuples_done = 0;
 	bool		deduplicate;
 
-	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal);
+	unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 318dbee823..e30f3623f5 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -165,7 +165,7 @@ spgbuildempty(Relation index)
 	 * internally. However, were this to be replaced with unbuffered_extend(),
 	 * do_wal must be true to ensure the data is logged and fsync'd.
 	 */
-	unbuffered_prep(&wstate, true);
+	unbuffered_prep(&wstate, true, false);
 
 	/* Construct metapage. */
 	page = (Page) palloc(BLCKSZ);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d2bd7a357..db7b33ec67 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5971,6 +5971,19 @@ GetLastImportantRecPtr(void)
 	return res;
 }
 
+bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
+{
+	XLogRecPtr ptr;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ptr = XLogCtl->RedoRecPtr;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (RedoRecPtr < ptr)
+		RedoRecPtr = ptr;
+
+	return RedoRecPtr != comparator_ptr;
+}
+
 /*
  * Get the time and LSN of the last xlog segment switch
  */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0b211895c1..2fd5a31ffd 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	use_wal = XLogIsNeeded() &&
 		(relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
 
-	unbuffered_prep(&wstate, use_wal);
+	unbuffered_prep(&wstate, use_wal, false);
 
 	nblocks = smgrnblocks(src, forkNum);
 
diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
index 42c37daa7a..4c3a5a2f74 100644
--- a/src/backend/storage/direct/directmgr.c
+++ b/src/backend/storage/direct/directmgr.c
@@ -15,14 +15,23 @@
 #include "postgres.h"
 
 
+#include "access/xlog.h"
 #include "access/xlogdefs.h"
 #include "access/xloginsert.h"
 #include "storage/directmgr.h"
 
 void
-unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool
+		do_optimization)
 {
+	/*
+	 * There is no valid fsync optimization if no WAL is being written anyway
+	 */
+	Assert(!do_optimization || (do_optimization && do_wal));
+
 	wstate->do_wal = do_wal;
+	wstate->do_optimization = do_optimization;
+	wstate->redo = do_optimization ? GetRedoRecPtr() : InvalidXLogRecPtr;
 }
 
 void
@@ -94,5 +103,8 @@ unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
 	if (!wstate->do_wal)
 		return;
 
+	if (wstate->do_optimization && !RedoRecPtrChanged(wstate->redo))
+		return;
+
 	smgrimmedsync(smgrrel, forknum);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4b45ac64db..71fe99a28d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -241,6 +241,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI);
 extern TimeLineID GetWALInsertionTimeLine(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h
index db5e3b1cac..e5454a3296 100644
--- a/src/include/storage/directmgr.h
+++ b/src/include/storage/directmgr.h
@@ -14,6 +14,7 @@
 #ifndef DIRECTMGR_H
 #define DIRECTMGR_H
 
+#include "access/xlogdefs.h"
 #include "common/relpath.h"
 #include "storage/block.h"
 #include "storage/bufpage.h"
@@ -28,14 +29,22 @@ typedef struct UnBufferedWriteState
 	 * must fsync the pages they have written themselves. This is necessary
 	 * only if the relation is WAL-logged or in special cases such as the init
 	 * fork of an unlogged index.
+	 *
+	 * These callers can optionally use the following optimization: attempt to
+	 * use the sync request queue and fall back to fsync'ing the pages
+	 * themselves if the Redo pointer moves between the start and finish of
+	 * their write. In order to do this, they must set do_optimization to true
+	 * so that the redo pointer is saved before the write begins.
 	 */
 	bool do_wal;
+	bool do_optimization;
+	XLogRecPtr redo;
 } UnBufferedWriteState;
 /*
  * prototypes for functions in directmgr.c
  */
 extern void
-unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);
+unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool do_optimization);
 extern void
 unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
 		ForkNumber forknum, BlockNumber blocknum, Page page, bool empty);
-- 
2.30.2

Re: Avoiding smgrimmedsync() during nbtree index builds

Reply via email to