Re: Avoiding smgrimmedsync() during nbtree index builds

Andres Freund Thu, 09 Dec 2021 11:34:15 -0800

Hi,

On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:
> From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplage...@gmail.com>
> Date: Thu, 15 Apr 2021 07:01:01 -0400
> Subject: [PATCH v2] Index build avoids immed fsync
> 
> Avoid immediate fsync for just built indexes either by using shared
> buffers or by leveraging checkpointer's SyncRequest queue. When a
> checkpoint begins during the index build, if not using shared buffers,
> the backend will have to do its own fsync.
> ---
>  src/backend/access/nbtree/nbtree.c  |  39 +++---
>  src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
>  src/backend/access/transam/xlog.c   |  14 +++
>  src/include/access/xlog.h           |   1 +
>  4 files changed, 188 insertions(+), 52 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c 
> b/src/backend/access/nbtree/nbtree.c
> index 40ad0956e0..a2e32f18e6 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -150,30 +150,29 @@ void
>  btbuildempty(Relation index)
>  {
>       Page            metapage;
> +     Buffer metabuf;
>  
> -     /* Construct metapage. */
> -     metapage = (Page) palloc(BLCKSZ);
> -     _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
> -
> +     // TODO: test this.


Shouldn't this path have plenty coverage?


>       /*
> -      * Write the page and log it.  It might seem that an immediate sync 
> would
> -      * be sufficient to guarantee that the file exists on disk, but recovery
> -      * itself might remove it while replaying, for example, an
> -      * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
> -      * this even when wal_level=minimal.
> +      * Construct metapage.
> +      * Because we don't need to lock the relation for extension (since
> +      * noone knows about it yet) and we don't need to initialize the
> +      * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
> +      * (with P_NEW and BT_WRITE) is overkill.

Isn't the more relevant operation the log_newpage_buffer()?


> However, it might be worth
> +      * either modifying it or adding a new helper function instead of
> +      * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
> +      * because we want to hold an exclusive lock on the buffer content
>        */

"modifying it" refers to what?

I don't see a problem using ReadBufferExtended() here. Why would you like to
replace it with something else?



> +     /*
> +      * Based on the number of tuples, either create a buffered or unbuffered
> +      * write state. if the number of tuples is small, make a buffered write
> +      * if the number of tuples is larger, then we make an unbuffered write 
> state
> +      * and must ensure that we check the redo pointer to know whether or 
> not we
> +      * need to fsync ourselves
> +      */
>  
>       /*
>        * Finish the build by (1) completing the sort of the spool file, (2)
>        * inserting the sorted tuples into btree pages and (3) building the 
> upper
>        * levels.  Finally, it may also be necessary to end use of parallelism.
>        */
> -     _bt_leafbuild(buildstate.spool, buildstate.spool2);
> +     if (reltuples > 1000)

I'm ok with some random magic constant here, but it seems worht putting it in
some constant / #define to make it more obvious.

> +             _bt_leafbuild(buildstate.spool, buildstate.spool2, false);
> +     else
> +             _bt_leafbuild(buildstate.spool, buildstate.spool2, true);

Why duplicate the function call?


>  /*
>   * allocate workspace for a new, clean btree page, not linked to any 
> siblings.
> + * If index is not built in shared buffers, buf should be InvalidBuffer
>   */
>  static Page
> -_bt_blnewpage(uint32 level)
> +_bt_blnewpage(uint32 level, Buffer buf)
>  {
>       Page            page;
>       BTPageOpaque opaque;
>  
> -     page = (Page) palloc(BLCKSZ);
> +     if (buf)
> +             page = BufferGetPage(buf);
> +     else
> +             page = (Page) palloc(BLCKSZ);
>  
>       /* Zero the page and set up standard page header info */
>       _bt_pageinit(page, BLCKSZ);

Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
pfree()ing a page that's actually in shared buffers. I think it'd make sense
to separate the allocation from the initialization bits?


> @@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
>   * emit a completed btree page, and release the working storage.
>   */
>  static void
> -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer 
> buf)
>  {
> +     if (wstate->use_shared_buffers)
> +     {
> +             Assert(buf);
> +             START_CRIT_SECTION();
> +             MarkBufferDirty(buf);
> +             if (wstate->btws_use_wal)
> +                     log_newpage_buffer(buf, true);
> +             END_CRIT_SECTION();
> +             _bt_relbuf(wstate->index, buf);
> +             return;
> +     }
> +
>       /* XLOG stuff */
>       if (wstate->btws_use_wal)
>       {
> @@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, 
> BlockNumber blkno)
>               smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
>                                  wstate->btws_pages_written++,
>                                  (char *) wstate->btws_zeropage,
> -                                true);
> +                                false);
>       }

Is there a good place to document the way we ensure durability for this path?


> +     /*
> +      * Extend the index relation upfront to reserve the metapage
> +      */
> +     if (wstate->use_shared_buffers)
> +     {
> +             /*
> +              * We should not need to LockRelationForExtension() as no one 
> else knows
> +              * about this index yet?
> +              * Extend the index relation by one block for the metapage. 
> _bt_getbuf()
> +              * is not used here as it does _bt_pageinit() which is one 
> later by

*done


> +              * _bt_initmetapage(). We will fill in the metapage and write 
> it out at
> +              * the end of index build when we have all of the information 
> required
> +              * for the metapage. However, we initially extend the relation 
> for it to
> +              * occupy block 0 because it is much easier when using shared 
> buffers to
> +              * extend the relation with a block number that is always 
> increasing by
> +              * 1.

Not quite following what you're trying to get at here. There generally is no
way to extend a relation except by increasing block numbers?


> @@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, 
> BTSpool *btspool2)
>        * still not be on disk when the crash occurs.
>        */
>       if (wstate->btws_use_wal)
> -             smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +     {
> +             if (!wstate->use_shared_buffers && 
> RedoRecPtrChanged(wstate->redo))
> +                     smgrimmedsync(RelationGetSmgr(wstate->index), 
> MAIN_FORKNUM);
> +     }
>  }
>  
>  /*

This needs documentation. The whole comment above isn't accurate anymore afaict?


> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index 1616448368..63fd212787 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
>       return RedoRecPtr;
>  }
>  
> +bool
> +RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{
> +     XLogRecPtr      ptr;
> +
> +     SpinLockAcquire(&XLogCtl->info_lck);
> +     ptr = XLogCtl->RedoRecPtr;
> +     SpinLockRelease(&XLogCtl->info_lck);
> +
> +     if (RedoRecPtr < ptr)
> +             RedoRecPtr = ptr;
> +     return RedoRecPtr != comparator_ptr;
> +}

What's the deal with the < comparison?


Greetings,

Andres Freund

Re: Avoiding smgrimmedsync() during nbtree index builds

Reply via email to