Re: Avoiding smgrimmedsync() during nbtree index builds

Andres Freund Thu, 10 Mar 2022 10:33:19 -0800

Hi,

> From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplage...@gmail.com>
> Date: Fri, 4 Mar 2022 14:48:39 -0500
> Subject: [PATCH v6 1/4] Add unbuffered IO API


I think this or one of the following patches should update 
src/backend/access/transam/README


> @@ -164,6 +164,16 @@ void
>  blbuildempty(Relation index)
>  {
>       Page            metapage;
> +     UnBufferedWriteState wstate;
> +
> +     /*
> +      * Though this is an unlogged relation, pass do_wal=true since the init
> +      * fork of an unlogged index must be wal-logged and fsync'd. This 
> currently
> +      * has no effect, as unbuffered_write() does not do log_newpage()
> +      * internally. However, were this to be replaced with 
> unbuffered_extend(),
> +      * do_wal must be true to ensure the data is logged and fsync'd.
> +      */
> +     unbuffered_prep(&wstate, true);

Wonder if unbuffered_write should have an assert checking that no writes to
INIT_FORKNUM are non-durable? Looks like it's pretty easy to forget...

I'd choose unbuffered_begin over _prep().


>       /* Construct metapage. */
>       metapage = (Page) palloc(BLCKSZ);
> @@ -176,18 +186,13 @@ blbuildempty(Relation index)
>        * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
>        * this even when wal_level=minimal.
>        */
> -     PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
> -     smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> -                       (char *) metapage, true);
> +     unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> +                     BLOOM_METAPAGE_BLKNO, metapage);
> +
>       log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
>                               BLOOM_METAPAGE_BLKNO, metapage, true);
>  
> -     /*
> -      * An immediate sync is required even if we xlog'd the page, because the
> -      * write did not go through shared_buffers and therefore a concurrent
> -      * checkpoint may have moved the redo pointer past our xlog record.
> -      */
> -     smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
> +     unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
>  }

I mildly prefer complete over finish, but ...



> - * We can't use the normal heap_insert function to insert into the new
> - * heap, because heap_insert overwrites the visibility information.
> - * We use a special-purpose raw_heap_insert function instead, which
> - * is optimized for bulk inserting a lot of tuples, knowing that we have
> - * exclusive access to the heap.  raw_heap_insert builds new pages in
> - * local storage.  When a page is full, or at the end of the process,
> - * we insert it to WAL as a single record and then write it to disk
> - * directly through smgr.  Note, however, that any data sent to the new
> - * heap's TOAST table will go through the normal bufmgr.
> + * We can't use the normal heap_insert function to insert into the new heap,
> + * because heap_insert overwrites the visibility information. We use a
> + * special-purpose raw_heap_insert function instead, which is optimized for
> + * bulk inserting a lot of tuples, knowing that we have exclusive access to 
> the
> + * heap.  raw_heap_insert builds new pages in local storage.  When a page is
> + * full, or at the end of the process, we insert it to WAL as a single record
> + * and then write it to disk directly through directmgr.  Note, however, that
> + * any data sent to the new heap's TOAST table will go through the normal
> + * bufmgr.

Part of this just reflows existing lines that seem otherwise unchanged, making
it harder to see the actual change.



> @@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level)
>  static void
>  _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>  {
> -     /* XLOG stuff */
> -     if (wstate->btws_use_wal)
> -     {
> -             /* We use the XLOG_FPI record type for this */
> -             log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, 
> true);
> -     }
> -
>       /*
>        * If we have to write pages nonsequentially, fill in the space with
>        * zeroes until we come back and overwrite.  This is not logically
> @@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, 
> BlockNumber blkno)
>       {
>               if (!wstate->btws_zeropage)
>                       wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
> -             /* don't set checksum for all-zero page */
> -             smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> -                                wstate->btws_pages_written++,
> -                                (char *) wstate->btws_zeropage,
> -                                true);
> +
> +             unbuffered_extend(&wstate->ub_wstate, 
> RelationGetSmgr(wstate->index),
> +                             MAIN_FORKNUM, wstate->btws_pages_written++,
> +                             wstate->btws_zeropage, true);
>       }

For a bit I thought the true argument to unbuffered_extend was about
durability or registering fsyncs. Perhaps worth making it flags argument with
an enum for flag arguments?


> diff --git a/src/backend/storage/direct/directmgr.c 
> b/src/backend/storage/direct/directmgr.c
> new file mode 100644
> index 0000000000..42c37daa7a
> --- /dev/null
> +++ b/src/backend/storage/direct/directmgr.c
> @@ -0,0 +1,98 @@

Now that the API is called unbuffered, the filename / directory seem
confusing.


> +void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +{
> +     wstate->do_wal = do_wal;
> +}
> +
> +void
> +unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation
> +             smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, 
> bool
> +             empty)
> +{
> +     /*
> +      * Don't checksum empty pages
> +      */
> +     if (!empty)
> +             PageSetChecksumInplace(page, blocknum);
> +
> +     smgrextend(smgrrel, forknum, blocknum, (char *) page, true);
> +
> +     /*
> +      * Don't WAL-log empty pages
> +      */
> +     if (!empty && wstate->do_wal)
> +             log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
> +                                     blocknum, page, true);
> +}
> +
> +void
> +unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
> +             ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page 
> *pages,
> +             bool empty, XLogRecPtr custom_lsn)
> +{
> +     for (int i = 0; i < num_pages; i++)
> +     {
> +             Page            page = pages[i];
> +             BlockNumber blkno = blocknums[i];
> +
> +             if (!XLogRecPtrIsInvalid(custom_lsn))
> +                     PageSetLSN(page, custom_lsn);
> +
> +             if (!empty)
> +                     PageSetChecksumInplace(page, blkno);
> +
> +             smgrextend(smgrrel, forknum, blkno, (char *) page, true);
> +     }
> +
> +     if (!empty && wstate->do_wal)
> +             log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages,
> +                             blocknums, pages, true);
> +}
> +
> +void
> +unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, 
> ForkNumber
> +             forknum, BlockNumber blocknum, Page page)
> +{
> +     PageSetChecksumInplace(page, blocknum);
> +
> +     smgrwrite(smgrrel, forknum, blocknum, (char *) page, true);
> +}

Seem several of these should have some minimal documentation?


> +/*
> + * When writing data outside shared buffers, a concurrent CHECKPOINT can move
> + * the redo pointer past our WAL entries and won't flush our data to disk. If
> + * the database crashes before the data makes it to disk, our WAL won't be
> + * replayed and the data will be lost.
> + * Thus, if a CHECKPOINT begins between unbuffered_prep() and
> + * unbuffered_finish(), the backend must fsync the data itself.
> + */

Hm. The last sentence sounds like this happens conditionally, but it doesn't
at this point.



> +typedef struct UnBufferedWriteState
> +{
> +     /*
> +      * When writing WAL-logged relation data outside of shared buffers, 
> there
> +      * is a risk of a concurrent CHECKPOINT moving the redo pointer past the
> +      * data's associated WAL entries. To avoid this, callers in this 
> situation
> +      * must fsync the pages they have written themselves. This is necessary
> +      * only if the relation is WAL-logged or in special cases such as the 
> init
> +      * fork of an unlogged index.
> +      */
> +     bool do_wal;
> +} UnBufferedWriteState;
> +/*
> + * prototypes for functions in directmgr.c
> + */

Newline in between end of struct and comment.

> +extern void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);

In headers we don't put the return type on a separate line :/





> --- a/contrib/bloom/blinsert.c
> +++ b/contrib/bloom/blinsert.c
> @@ -173,7 +173,7 @@ blbuildempty(Relation index)
>        * internally. However, were this to be replaced with 
> unbuffered_extend(),
>        * do_wal must be true to ensure the data is logged and fsync'd.
>        */
> -     unbuffered_prep(&wstate, true);
> +     unbuffered_prep(&wstate, true, false);

This makes me think this really should be a flag argument...

I also don't like the current name of the parameter "do_optimization" doesn't
explain much.


> +bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{

newline after return type.

>  void
> -unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool
> +             do_optimization)

See earlier comments about documentation and parameter naming...


> +      * These callers can optionally use the following optimization: attempt 
> to
> +      * use the sync request queue and fall back to fsync'ing the pages
> +      * themselves if the Redo pointer moves between the start and finish of
> +      * their write. In order to do this, they must set do_optimization to 
> true
> +      * so that the redo pointer is saved before the write begins.
>        */

When do we not want this?



> From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplage...@gmail.com>
> Date: Fri, 4 Mar 2022 15:53:05 -0500
> Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization
> 
> While building a btree index, the backend can avoid fsync'ing all of the
> pages if it uses the optimization introduced in a prior commit.
> 
> This can substantially improve performance when many indexes are being
> built during DDL operations.
> ---
>  src/backend/access/nbtree/nbtree.c  | 2 +-
>  src/backend/access/nbtree/nbtsort.c | 6 +++++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c 
> b/src/backend/access/nbtree/nbtree.c
> index 6b78acefbe..fc5cce4603 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -161,7 +161,7 @@ btbuildempty(Relation index)
>        * internally. However, were this to be replaced with 
> unbuffered_extend(),
>        * do_wal must be true to ensure the data is logged and fsync'd.
>        */
> -     unbuffered_prep(&wstate, true, false);
> +     unbuffered_prep(&wstate, true, true);
>  
>       /* Construct metapage. */
>       metapage = (Page) palloc(BLCKSZ);
> diff --git a/src/backend/access/nbtree/nbtsort.c 
> b/src/backend/access/nbtree/nbtsort.c
> index d6d0d4b361..f1b9e2e24e 100644
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, 
> BTSpool *btspool2)
>       int64           tuples_done = 0;
>       bool            deduplicate;
>  
> -     unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
> +     /*
> +      * The fsync optimization done by directmgr is only relevant if
> +      * WAL-logging, so pass btws_use_wal for this parameter.
> +      */
> +     unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, 
> wstate->btws_use_wal);
>  
>       deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
>               BTGetDeduplicateItems(wstate->index);

Why just here?



> From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplage...@gmail.com>
> Date: Fri, 4 Mar 2022 15:52:45 -0500
> Subject: [PATCH v6 4/4] Use shared buffers when possible for index build
> 
> When there are not too many tuples, building the index in shared buffers
> makes sense. It allows the buffer manager to handle how best to do the
> IO.
> ---

Perhaps it'd be worth making this an independent patch that could be applied
separately?


>  src/backend/access/nbtree/nbtree.c  |  32 ++--
>  src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++-------
>  2 files changed, 223 insertions(+), 82 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c 
> b/src/backend/access/nbtree/nbtree.c
> index fc5cce4603..d3982b9388 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -152,34 +152,24 @@ void
>  btbuildempty(Relation index)
>  {
>       Page            metapage;
> -     UnBufferedWriteState wstate;
> +     Buffer metabuf;
>  
>       /*
> -      * Though this is an unlogged relation, pass do_wal=true since the init
> -      * fork of an unlogged index must be wal-logged and fsync'd. This 
> currently
> -      * has no effect, as unbuffered_write() does not do log_newpage()
> -      * internally. However, were this to be replaced with 
> unbuffered_extend(),
> -      * do_wal must be true to ensure the data is logged and fsync'd.
> +      * Allocate a buffer for metapage and initialize metapage.
>        */
> -     unbuffered_prep(&wstate, true, true);
> -
> -     /* Construct metapage. */
> -     metapage = (Page) palloc(BLCKSZ);
> +     metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, 
> RBM_ZERO_AND_LOCK,
> +                     NULL);
> +     metapage = BufferGetPage(metabuf);
>       _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
>  
>       /*
> -      * Write the page and log it.  It might seem that an immediate sync 
> would
> -      * be sufficient to guarantee that the file exists on disk, but recovery
> -      * itself might remove it while replaying, for example, an
> -      * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
> -      * this even when wal_level=minimal.
> +      * Mark metapage buffer as dirty and XLOG it
>        */
> -     unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> -                     BTREE_METAPAGE, metapage);
> -     log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
> -                             BTREE_METAPAGE, metapage, true);
> -
> -     unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
> +     START_CRIT_SECTION();
> +     MarkBufferDirty(metabuf);
> +     log_newpage_buffer(metabuf, true);
> +     END_CRIT_SECTION();
> +     _bt_relbuf(index, metabuf);
>  }

I don't understand why this patch changes btbuildempty()? That data is never
accessed again, so it doesn't really seem beneficial to shuffle it through
shared buffers, even if we benefit from using s_b for the main fork...



> +     /*
> +      * If not using shared buffers, for a WAL-logged relation, save the redo
> +      * pointer location in case a checkpoint begins during the index build.
> +      */
> +     if (wstate->_bt_bl_unbuffered_prep)
> +             wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
> +                             wstate->btws_use_wal, wstate->btws_use_wal);

Code would probably look cleaner if an empty callback were used when no
_bt_bl_unbuffered_prep callback is needed.



>  /*
> - * allocate workspace for a new, clean btree page, not linked to any 
> siblings.
> + * Set up workspace for a new, clean btree page, not linked to any siblings.
> + * Caller must allocate the passed in page.

More interesting bit seems to be whether the passed in page needs to be
initialized?


> @@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState 
> *state)
>                * back one slot.  Then we can dump out the page.
>                */
>               _bt_slideleft(s->btps_page);
> -             _bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
> +             wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, 
> s->btps_buf);
> +             s->btps_buf = InvalidBuffer;
>               s->btps_page = NULL;    /* writepage freed the workspace */
>       }

Do we really have to add underscores even to struct members? That just looks
wrong.


Greetings,

Andres Freund

Re: Avoiding smgrimmedsync() during nbtree index builds

Reply via email to