So, I've written a patch which avoids doing the immediate fsync for index builds either by using shared buffers or by queueing sync requests for the checkpointer. If a checkpoint starts during the index build and the backend is not using shared buffers for the index build, it will need to do the fsync.
The reviewer will notice that _bt_load() extends the index relation for the metapage before beginning the actual load of leaf pages but does not actually write the metapage until the end of the index build. When using shared buffers, it was difficult to create block 0 of the index after creating all of the other blocks, as the block number is assigned inside of ReadBuffer_common(), and it doesn't really work with the current bufmgr API to extend a relation with a caller-specified block number. I am not entirely sure of the correctness of doing an smgrextend() (when not using shared buffers) without writing any WAL. However, the metapage contents are not written until after WAL logging them later in _bt_blwritepage(), so, perhaps it is okay? I am also not fond of the change to the signature of _bt_uppershutdown() that this implementation forces. Now, I must pass the shared buffer (when using shared buffers) that I've reserved (pinned and locked) for the metapage and, if not using shared buffers, the page I've allocated for the metapage, before doing the index build to _bt_uppershutdown() after doing the rest of the index build. I don't know that it seems incorrect -- more that it feels a bit messy (and inefficient) to hold onto that shared buffer or memory for the duration of the index build, during which I have no intention of doing anything with that buffer or memory. However, the alternative I devised was to change ReadBuffer_common() or to add a new ReadBufferExtended() mode which indicated that the caller would specify the block number and whether or not it was an extend, which also didn't seem right. For the extensions of the index done during index build, I use ReadBufferExtended() directly instead of _bt_getbuf() for a few reasons. I thought (am not sure) that I don't need to do LockRelationForExtension() during index build. Also, I decided to use RBM_ZERO_AND_LOCK mode so that I had an exclusive lock on the buffer content instead of doing _bt_lockbuf() (which is what _bt_getbuf() does). And, most of the places I added the call to ReadBufferExtended(), the non-shared buffer code path is already initializing the page, so it made more sense to just share that codepath. I considered whether or not it made sense to add a new btree utility function which calls ReadBufferExtended() in this way, however, I wasn't sure how much that would buy me. The other place it might be able to be used is btvacuumpage(), but that case is different enough that I'm not even sure what the function would be called -- basically it would just be an alternative to _bt_getbuf() for a couple of somewhat unrelated edge cases. On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <and...@anarazel.de> wrote: > > Hi, > > On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote: > > On 21/01/2021 22:36, Andres Freund wrote: > > > A quick hack (probably not quite correct!) to evaluate the benefit shows > > > that the attached script takes 2m17.223s with the smgrimmedsync and > > > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in > > > the former case, CPU bound in the latter. > > > > > > Creating lots of tables with indexes (directly or indirectly through > > > relations having a toast table) is pretty common, particularly after the > > > introduction of partitioning. > > > Moving index builds of indexes which would fit in shared buffers back into shared buffers has the benefit of eliminating the need to write them out and fsync them if they will be subsequently used and thus read right back into shared buffers. This avoids some of the unnecessary fsyncs Andres is talking about here as well as avoiding some of the extra IO required to write them and then read them into shared buffers. I have dummy criteria for whether or not to use shared buffers (if the number of tuples to be indexed is > 1000). I am considering using a threshold of some percentage of the size of shared buffers as the actual criteria for determining where to do the index build. > > > > > > Thinking through the correctness of replacing smgrimmedsync() with sync > > > requests, the potential problems that I can see are: > > > > > > 1) redo point falls between the log_newpage() and the > > > write()/register_dirty_segment() in smgrextend/smgrwrite. > > > 2) redo point falls between write() and register_dirty_segment() > > > > > > But both of these are fine in the context of initially filling a newly > > > created relfilenode, as far as I can tell? Otherwise the current > > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell? > > > > Hmm. If the redo point falls between write() and the > > register_dirty_segment(), and the checkpointer finishes the whole checkpoint > > before register_dirty_segment(), you are not safe. That can't happen with > > write from the buffer manager, because the checkpointer would block waiting > > for the flush of the buffer to finish. > > Hm, right. > > The easiest way to address that race would be to just record the redo > pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a > checkpoint started since the start of the index build. > > Another approach would be to utilize PGPROC.delayChkpt, but I would > rather not unnecessarily expand the use of that. > > It's kind of interesting - in my aio branch I moved the > register_dirty_segment() to before the actual asynchronous write (due to > availability of the necessary data), which ought to be safe because of > the buffer interlocking. But that doesn't work here, or for other places > doing writes without going through s_b. It'd be great if we could come > up with a general solution, but I don't immediately see anything great. > > The best I can come up with is adding helper functions to wrap some of > the complexity for "unbuffered" writes of doing an immedsync iff the > redo pointer changed. Something very roughly like > > typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} > UnbufferedWriteState; > void unbuffered_prep(UnbufferedWriteState* state); > void unbuffered_write(UnbufferedWriteState* state, ...); > void unbuffered_extend(UnbufferedWriteState* state, ...); > void unbuffered_finish(UnbufferedWriteState* state); > > which wouldn't just do the dance to avoid the immedsync() if possible, > but also took care of PageSetChecksumInplace() (and PageEncryptInplace() > if we get that [1]). > Regarding the implementation, I think having an API to do these "unbuffered" or "direct" writes outside of shared buffers is a good idea. In this specific case, the proposed API would not change the code much. I would just wrap the small diffs I added to the beginning and end of _bt_load() in the API calls for unbuffered_prep() and unbuffered_finish() and then tuck away the second half of _bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I would do so after ensuring the correctness of the logic in this patch. Then I will work on a patch which implements the unbuffered_write() API and demonstrates its utility with at least a few of the most compelling most compelling use cases in the code. - Melanie
From 59837dfabd306bc17dcc02bd5f63c7bf5809f9d0 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Thu, 15 Apr 2021 07:01:01 -0400 Subject: [PATCH v1] Index build avoids immed fsync Avoid immediate fsync for just built indexes either by using shared buffers or by leveraging checkpointer's SyncRequest queue. When a checkpoint begins during the index build, if not using shared buffers, the backend will have to do its own fsync. --- src/backend/access/nbtree/nbtree.c | 39 +++--- src/backend/access/nbtree/nbtsort.c | 189 +++++++++++++++++++++++----- src/backend/access/transam/xlog.c | 14 +++ src/include/access/xlog.h | 1 + 4 files changed, 190 insertions(+), 53 deletions(-) diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 1360ab80c1..ed3ee8d0e3 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -150,30 +150,29 @@ void btbuildempty(Relation index) { Page metapage; + Buffer metabuf; - /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); - _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false)); - + // TODO: test this. /* - * Write the page and log it. It might seem that an immediate sync would - * be sufficient to guarantee that the file exists on disk, but recovery - * itself might remove it while replaying, for example, an - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need - * this even when wal_level=minimal. + * Construct metapage. + * Because we don't need to lock the relation for extension (since + * noone knows about it yet) and we don't need to initialize the + * new page, as it is done below by _bt_blnewpage(), _bt_getbuf() + * (with P_NEW and BT_WRITE) is overkill. However, it might be worth + * either modifying it or adding a new helper function instead of + * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK + * because we want to hold an exclusive lock on the buffer content */ - PageSetChecksumInplace(metapage, BTREE_METAPAGE); - smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, - (char *) metapage, true); - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - BTREE_METAPAGE, metapage, true); + metabuf = ReadBufferExtended(index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL); - /* - * An immediate sync is required even if we xlog'd the page, because the - * write did not go through shared_buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(index->rd_smgr, INIT_FORKNUM); + metapage = BufferGetPage(metabuf); + _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false)); + + START_CRIT_SECTION(); + MarkBufferDirty(metabuf); + log_newpage_buffer(metabuf, true); + END_CRIT_SECTION(); + _bt_relbuf(index, metabuf); } /* diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 2c4d7f6e25..bde02361e1 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -233,6 +233,7 @@ typedef struct BTPageState { Page btps_page; /* workspace for page building */ BlockNumber btps_blkno; /* block # to write this page at */ + Buffer btps_buf; /* buffer to write this page to */ IndexTuple btps_lowkey; /* page's strict lower bound pivot tuple */ OffsetNumber btps_lastoff; /* last item offset loaded */ Size btps_lastextra; /* last item's extra posting list space */ @@ -250,9 +251,11 @@ typedef struct BTWriteState Relation index; BTScanInsert inskey; /* generic insertion scankey */ bool btws_use_wal; /* dump pages to WAL? */ - BlockNumber btws_pages_alloced; /* # pages allocated */ - BlockNumber btws_pages_written; /* # pages written out */ + BlockNumber btws_pages_alloced; /* # pages allocated for index build outside SB */ + BlockNumber btws_pages_written; /* # pages written out for index build outside SB */ Page btws_zeropage; /* workspace for filling zeroes */ + XLogRecPtr redo; /* cached redo pointer to determine if backend fsync is required at end of index build */ + bool use_shared_buffers; } BTWriteState; @@ -261,10 +264,11 @@ static double _bt_spools_heapscan(Relation heap, Relation index, static void _bt_spooldestroy(BTSpool *btspool); static void _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull); -static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2); +static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, + bool use_shared_buffers); static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values, bool *isnull, bool tupleIsAlive, void *state); -static Page _bt_blnewpage(uint32 level); +static Page _bt_blnewpage(uint32 level, Buffer buf); static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level); static void _bt_slideleft(Page rightmostpage); static void _bt_sortaddtup(Page page, Size itemsize, @@ -275,7 +279,8 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state, static void _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state, BTDedupState dstate); -static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state); +static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, + Buffer metabuf, Page metapage); static void _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2); static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, @@ -323,13 +328,24 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo) RelationGetRelationName(index)); reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo); + /* + * Based on the number of tuples, either create a buffered or unbuffered + * write state. if the number of tuples is small, make a buffered write + * if the number of tuples is larger, then we make an unbuffered write state + * and must ensure that we check the redo pointer to know whether or not we + * need to fsync ourselves + */ /* * Finish the build by (1) completing the sort of the spool file, (2) * inserting the sorted tuples into btree pages and (3) building the upper * levels. Finally, it may also be necessary to end use of parallelism. */ - _bt_leafbuild(buildstate.spool, buildstate.spool2); + if (reltuples > 1000) + _bt_leafbuild(buildstate.spool, buildstate.spool2, false); + else + _bt_leafbuild(buildstate.spool, buildstate.spool2, true); + _bt_spooldestroy(buildstate.spool); if (buildstate.spool2) _bt_spooldestroy(buildstate.spool2); @@ -535,7 +551,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull) * create an entire btree. */ static void -_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2) +_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool use_shared_buffers) { BTWriteState wstate; @@ -565,9 +581,11 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2) wstate.btws_use_wal = RelationNeedsWAL(wstate.index); /* reserve the metapage */ - wstate.btws_pages_alloced = BTREE_METAPAGE + 1; + wstate.btws_pages_alloced = 0; wstate.btws_pages_written = 0; wstate.btws_zeropage = NULL; /* until needed */ + wstate.redo = InvalidXLogRecPtr; + wstate.use_shared_buffers = use_shared_buffers; pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE, PROGRESS_BTREE_PHASE_LEAF_LOAD); @@ -605,14 +623,18 @@ _bt_build_callback(Relation index, /* * allocate workspace for a new, clean btree page, not linked to any siblings. + * If index is not built in shared buffers, buf should be InvalidBuffer */ static Page -_bt_blnewpage(uint32 level) +_bt_blnewpage(uint32 level, Buffer buf) { Page page; BTPageOpaque opaque; - page = (Page) palloc(BLCKSZ); + if (buf) + page = BufferGetPage(buf); + else + page = (Page) palloc(BLCKSZ); /* Zero the page and set up standard page header info */ _bt_pageinit(page, BLCKSZ); @@ -634,8 +656,20 @@ _bt_blnewpage(uint32 level) * emit a completed btree page, and release the working storage. */ static void -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf) { + if (wstate->use_shared_buffers) + { + Assert(buf); + START_CRIT_SECTION(); + MarkBufferDirty(buf); + if (wstate->btws_use_wal) + log_newpage_buffer(buf, true); + END_CRIT_SECTION(); + _bt_relbuf(wstate->index, buf); + return; + } + /* Ensure rd_smgr is open (could have been closed by relcache flush!) */ RelationOpenSmgr(wstate->index); @@ -661,7 +695,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM, wstate->btws_pages_written++, (char *) wstate->btws_zeropage, - true); + false); } PageSetChecksumInplace(page, blkno); @@ -674,14 +708,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) { /* extending the file... */ smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM, blkno, - (char *) page, true); + (char *) page, false); wstate->btws_pages_written++; } else { /* overwriting a block we zero-filled before */ smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno, - (char *) page, true); + (char *) page, false); } pfree(page); @@ -694,13 +728,37 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) static BTPageState * _bt_pagestate(BTWriteState *wstate, uint32 level) { + Buffer buf; + BlockNumber blkno; + BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState)); - /* create initial page for level */ - state->btps_page = _bt_blnewpage(level); + if (wstate->use_shared_buffers) + { + /* + * Because we don't need to lock the relation for extension (since + * noone knows about it yet) and we don't need to initialize the + * new page, as it is done below by _bt_blnewpage(), _bt_getbuf() + * (with P_NEW and BT_WRITE) is overkill. However, it might be worth + * either modifying it or adding a new helper function instead of + * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK + * because we want to hold an exclusive lock on the buffer content + */ + buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL); + + blkno = BufferGetBlockNumber(buf); + } + else + { + buf = InvalidBuffer; + blkno = wstate->btws_pages_alloced++; + } + /* create initial page for level */ + state->btps_page = _bt_blnewpage(level, buf); /* and assign it a page position */ - state->btps_blkno = wstate->btws_pages_alloced++; + state->btps_blkno = blkno; + state->btps_buf = buf; state->btps_lowkey = NULL; /* initialize lastoff so first item goes into P_FIRSTKEY */ @@ -835,6 +893,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, { Page npage; BlockNumber nblkno; + Buffer nbuf; OffsetNumber last_off; Size last_truncextra; Size pgspc; @@ -849,6 +908,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, npage = state->btps_page; nblkno = state->btps_blkno; + nbuf = state->btps_buf; last_off = state->btps_lastoff; last_truncextra = state->btps_lastextra; state->btps_lastextra = truncextra; @@ -905,16 +965,37 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, */ Page opage = npage; BlockNumber oblkno = nblkno; + Buffer obuf = nbuf; ItemId ii; ItemId hii; IndexTuple oitup; - /* Create new page of same level */ - npage = _bt_blnewpage(state->btps_level); + if (wstate->use_shared_buffers) + { + /* + * Get a new shared buffer. + * Because we don't need to lock the relation for extension (since + * noone knows about it yet) and we don't need to initialize the + * new page, as it is done below by _bt_blnewpage(), _bt_getbuf() + * (with P_NEW and BT_WRITE) is overkill. However, it might be worth + * either modifying it or adding a new helper function instead of + * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK + * because we want to hold an exclusive lock on the buffer content + */ + nbuf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL); - /* and assign it a page position */ - nblkno = wstate->btws_pages_alloced++; + /* assign a page position */ + nblkno = BufferGetBlockNumber(nbuf); + } + else + { + nbuf = InvalidBuffer; + /* assign a page position */ + nblkno = wstate->btws_pages_alloced++; + } + /* Create new page of same level */ + npage = _bt_blnewpage(state->btps_level, nbuf); /* * We copy the last item on the page into the new page, and then * rearrange the old page so that the 'last item' becomes its high key @@ -1023,10 +1104,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, /* * Write out the old page. We never need to touch it again, so we can - * free the opage workspace too. + * free the opage workspace too. obuf has been released and is no longer + * valid. */ - _bt_blwritepage(wstate, opage, oblkno); - + _bt_blwritepage(wstate, opage, oblkno, obuf); /* * Reset last_off to point to new page */ @@ -1060,6 +1141,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, state->btps_page = npage; state->btps_blkno = nblkno; + state->btps_buf = nbuf; state->btps_lastoff = last_off; } @@ -1105,12 +1187,11 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state, * Finish writing out the completed btree. */ static void -_bt_uppershutdown(BTWriteState *wstate, BTPageState *state) +_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf, Page metapage) { BTPageState *s; BlockNumber rootblkno = P_NONE; uint32 rootlevel = 0; - Page metapage; /* * Each iteration of this loop completes one more level of the tree. @@ -1156,20 +1237,22 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state) * back one slot. Then we can dump out the page. */ _bt_slideleft(s->btps_page); - _bt_blwritepage(wstate, s->btps_page, s->btps_blkno); + _bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf); + s->btps_buf = InvalidBuffer; s->btps_page = NULL; /* writepage freed the workspace */ } /* - * As the last step in the process, construct the metapage and make it + * As the last step in the process, initialize the metapage and make it * point to the new root (unless we had no data at all, in which case it's * set to point to "P_NONE"). This changes the index to the "valid" state * by filling in a valid magic number in the metapage. + * After this, metapage will have been freed or invalid and metabuf, if ever + * valid, will have been released. */ - metapage = (Page) palloc(BLCKSZ); _bt_initmetapage(metapage, rootblkno, rootlevel, wstate->inskey->allequalimage); - _bt_blwritepage(wstate, metapage, BTREE_METAPAGE); + _bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf); } /* @@ -1190,10 +1273,47 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) SortSupport sortKeys; int64 tuples_done = 0; bool deduplicate; + Buffer metabuf; + Page metapage; deduplicate = wstate->inskey->allequalimage && !btspool->isunique && BTGetDeduplicateItems(wstate->index); + /* + * Extend the index relation upfront to reserve the metapage + */ + if (wstate->use_shared_buffers) + { + /* + * We should not need to LockRelationForExtension() as no one else knows + * about this index yet? + * Extend the index relation by one block for the metapage. _bt_getbuf() + * is not used here as it does _bt_pageinit() which is one later by + * _bt_initmetapage(). We will fill in the metapage and write it out at + * the end of index build when we have all of the information required + * for the metapage. However, we initially extend the relation for it to + * occupy block 0 because it is much easier when using shared buffers to + * extend the relation with a block number that is always increasing by + * 1. Also, by passing RBM_ZERO_AND_LOCK, we have LW_EXCLUSIVE on the + * buffer content and thus don't need _bt_lockbuf(). + */ + metabuf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, NULL); + metapage = BufferGetPage(metabuf); + } + else + { + wstate->redo = GetRedoRecPtr(); + metabuf = InvalidBuffer; + metapage = (Page) palloc(BLCKSZ); + RelationOpenSmgr(wstate->index); + + /* extending the file... */ + smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM, BTREE_METAPAGE, + (char *) metapage, false); + wstate->btws_pages_written++; + wstate->btws_pages_alloced++; + } + if (merge) { /* @@ -1415,7 +1535,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) } /* Close down final pages and write the metapage */ - _bt_uppershutdown(wstate, state); + _bt_uppershutdown(wstate, state, metabuf, metapage); /* * When we WAL-logged index pages, we must nonetheless fsync index files. @@ -1428,8 +1548,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) */ if (wstate->btws_use_wal) { - RelationOpenSmgr(wstate->index); - smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM); + if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo)) + { + RelationOpenSmgr(wstate->index); + smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM); + } } } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index adfc6f67e2..d3b6c60278 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8546,6 +8546,20 @@ GetRedoRecPtr(void) return RedoRecPtr; } +bool +RedoRecPtrChanged(XLogRecPtr comparator_ptr) +{ + XLogRecPtr ptr; + + SpinLockAcquire(&XLogCtl->info_lck); + ptr = XLogCtl->RedoRecPtr; + SpinLockRelease(&XLogCtl->info_lck); + + if (RedoRecPtr < ptr) + RedoRecPtr = ptr; + return RedoRecPtr != comparator_ptr; +} + /* * Return information needed to decide whether a modified block needs a * full-page image to be included in the WAL record. diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index f542af0a26..44e4b01559 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -346,6 +346,7 @@ extern XLogRecPtr XLogRestorePoint(const char *rpName); extern void UpdateFullPageWrites(void); extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p); extern XLogRecPtr GetRedoRecPtr(void); +extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr); extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(void); extern XLogRecPtr GetLastImportantRecPtr(void); -- 2.27.0