Hi, On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote: > From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplage...@gmail.com> > Date: Thu, 15 Apr 2021 07:01:01 -0400 > Subject: [PATCH v2] Index build avoids immed fsync > > Avoid immediate fsync for just built indexes either by using shared > buffers or by leveraging checkpointer's SyncRequest queue. When a > checkpoint begins during the index build, if not using shared buffers, > the backend will have to do its own fsync. > --- > src/backend/access/nbtree/nbtree.c | 39 +++--- > src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++----- > src/backend/access/transam/xlog.c | 14 +++ > src/include/access/xlog.h | 1 + > 4 files changed, 188 insertions(+), 52 deletions(-) > > diff --git a/src/backend/access/nbtree/nbtree.c > b/src/backend/access/nbtree/nbtree.c > index 40ad0956e0..a2e32f18e6 100644 > --- a/src/backend/access/nbtree/nbtree.c > +++ b/src/backend/access/nbtree/nbtree.c > @@ -150,30 +150,29 @@ void > btbuildempty(Relation index) > { > Page metapage; > + Buffer metabuf; > > - /* Construct metapage. */ > - metapage = (Page) palloc(BLCKSZ); > - _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false)); > - > + // TODO: test this.
Shouldn't this path have plenty coverage? > /* > - * Write the page and log it. It might seem that an immediate sync > would > - * be sufficient to guarantee that the file exists on disk, but recovery > - * itself might remove it while replaying, for example, an > - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need > - * this even when wal_level=minimal. > + * Construct metapage. > + * Because we don't need to lock the relation for extension (since > + * noone knows about it yet) and we don't need to initialize the > + * new page, as it is done below by _bt_blnewpage(), _bt_getbuf() > + * (with P_NEW and BT_WRITE) is overkill. Isn't the more relevant operation the log_newpage_buffer()? > However, it might be worth > + * either modifying it or adding a new helper function instead of > + * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK > + * because we want to hold an exclusive lock on the buffer content > */ "modifying it" refers to what? I don't see a problem using ReadBufferExtended() here. Why would you like to replace it with something else? > + /* > + * Based on the number of tuples, either create a buffered or unbuffered > + * write state. if the number of tuples is small, make a buffered write > + * if the number of tuples is larger, then we make an unbuffered write > state > + * and must ensure that we check the redo pointer to know whether or > not we > + * need to fsync ourselves > + */ > > /* > * Finish the build by (1) completing the sort of the spool file, (2) > * inserting the sorted tuples into btree pages and (3) building the > upper > * levels. Finally, it may also be necessary to end use of parallelism. > */ > - _bt_leafbuild(buildstate.spool, buildstate.spool2); > + if (reltuples > 1000) I'm ok with some random magic constant here, but it seems worht putting it in some constant / #define to make it more obvious. > + _bt_leafbuild(buildstate.spool, buildstate.spool2, false); > + else > + _bt_leafbuild(buildstate.spool, buildstate.spool2, true); Why duplicate the function call? > /* > * allocate workspace for a new, clean btree page, not linked to any > siblings. > + * If index is not built in shared buffers, buf should be InvalidBuffer > */ > static Page > -_bt_blnewpage(uint32 level) > +_bt_blnewpage(uint32 level, Buffer buf) > { > Page page; > BTPageOpaque opaque; > > - page = (Page) palloc(BLCKSZ); > + if (buf) > + page = BufferGetPage(buf); > + else > + page = (Page) palloc(BLCKSZ); > > /* Zero the page and set up standard page header info */ > _bt_pageinit(page, BLCKSZ); Ick, that seems pretty ugly API-wise and subsequently too likely to lead to pfree()ing a page that's actually in shared buffers. I think it'd make sense to separate the allocation from the initialization bits? > @@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level) > * emit a completed btree page, and release the working storage. > */ > static void > -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) > +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer > buf) > { > + if (wstate->use_shared_buffers) > + { > + Assert(buf); > + START_CRIT_SECTION(); > + MarkBufferDirty(buf); > + if (wstate->btws_use_wal) > + log_newpage_buffer(buf, true); > + END_CRIT_SECTION(); > + _bt_relbuf(wstate->index, buf); > + return; > + } > + > /* XLOG stuff */ > if (wstate->btws_use_wal) > { > @@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, > BlockNumber blkno) > smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, > wstate->btws_pages_written++, > (char *) wstate->btws_zeropage, > - true); > + false); > } Is there a good place to document the way we ensure durability for this path? > + /* > + * Extend the index relation upfront to reserve the metapage > + */ > + if (wstate->use_shared_buffers) > + { > + /* > + * We should not need to LockRelationForExtension() as no one > else knows > + * about this index yet? > + * Extend the index relation by one block for the metapage. > _bt_getbuf() > + * is not used here as it does _bt_pageinit() which is one > later by *done > + * _bt_initmetapage(). We will fill in the metapage and write > it out at > + * the end of index build when we have all of the information > required > + * for the metapage. However, we initially extend the relation > for it to > + * occupy block 0 because it is much easier when using shared > buffers to > + * extend the relation with a block number that is always > increasing by > + * 1. Not quite following what you're trying to get at here. There generally is no way to extend a relation except by increasing block numbers? > @@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, > BTSpool *btspool2) > * still not be on disk when the crash occurs. > */ > if (wstate->btws_use_wal) > - smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM); > + { > + if (!wstate->use_shared_buffers && > RedoRecPtrChanged(wstate->redo)) > + smgrimmedsync(RelationGetSmgr(wstate->index), > MAIN_FORKNUM); > + } > } > > /* This needs documentation. The whole comment above isn't accurate anymore afaict? > diff --git a/src/backend/access/transam/xlog.c > b/src/backend/access/transam/xlog.c > index 1616448368..63fd212787 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -8704,6 +8704,20 @@ GetRedoRecPtr(void) > return RedoRecPtr; > } > > +bool > +RedoRecPtrChanged(XLogRecPtr comparator_ptr) > +{ > + XLogRecPtr ptr; > + > + SpinLockAcquire(&XLogCtl->info_lck); > + ptr = XLogCtl->RedoRecPtr; > + SpinLockRelease(&XLogCtl->info_lck); > + > + if (RedoRecPtr < ptr) > + RedoRecPtr = ptr; > + return RedoRecPtr != comparator_ptr; > +} What's the deal with the < comparison? Greetings, Andres Freund