On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <[email protected]> wrote: > > On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote: > > Thanks for doing that. On my side, I am going to look at the gin and > > hash vacuum paths first with more testing as these don't use a custom > > callback. I don't think that I am going to need a lot of convincing, > > but I'd rather produce some numbers myself because doing something. > > I'll tweak a mounting point with the delay trick, as well. > > While debug_io_direct has been helping a bit, the trick for the delay > to throttle the IO activity has helped much more with my runtime > numbers. I have mounted a separate partition with a delay of 5ms, > disabled checkums (this part did not make a real difference), and > evicted shared buffers for relation and indexes before the VACUUM. > > Then I got better numbers. Here is an extract: > - worker=3: > gin_vacuum (100k tuples) base= 1448.2ms patch= 572.5ms 2.53x > ( 60.5%) (reads=175→104, io_time=1382.70→506.64ms) > gin_vacuum (300k tuples) base= 3728.0ms patch= 1332.0ms 2.80x > ( 64.3%) (reads=486→293, io_time=3669.89→1266.27ms) > bloom_vacuum (100k tuples) base= 21826.8ms patch= 17220.3ms 1.27x > ( 21.1%) (reads=485→117, io_time=4773.33→270.56ms) > bloom_vacuum (300k tuples) base= 67054.0ms patch= 53164.7ms 1.26x > ( 20.7%) (reads=1431.5→327.5, io_time=13880.2→381.395ms) > - io_uring: > gin_vacuum (100k tuples) base= 1240.3ms patch= 360.5ms 3.44x > ( 70.9%) (reads=175→104, io_time=1175.35→299.75ms) > gin_vacuum (300k tuples) base= 2829.9ms patch= 642.0ms 4.41x > ( 77.3%) (reads=465.5→293, io_time=2768.46→579.04ms) > bloom_vacuum (100k tuples) base= 22121.7ms patch= 17532.3ms 1.26x > ( 20.7%) (reads=485→117, io_time=4850.46→285.28ms) > bloom_vacuum (300k tuples) base= 67058.0ms patch= 53118.0ms 1.26x > ( 20.8%) (reads=1431.5→327.5, io_time=13870.9→305.44ms) > > The higher the number of tuples, the better the performance for each > individual operation, but the tests take a much longer time (tens of > seconds vs tens of minutes). For GIN, the numbers can be quite good > once these reads are pushed. For bloom, the runtime is improved, and > the IO numbers are much better. > > At the end, I have applied these two parts. Remains now the hash > vacuum and the two parts for pgstattuple. > -- > Michael
Thanks for running the benchmarks and pushing!
Here're the results of my test with debug_io_direct and delay :
-- io_uring, medium size
bloom_vacuum_medium base= 8355.2ms patch= 715.0ms 11.68x
( 91.4%) (reads=4732→1056, io_time=7699.47→86.52ms)
pgstattuple_medium base= 4012.8ms patch= 213.7ms 18.78x
( 94.7%) (reads=2006→2006, io_time=4001.66→200.24ms)
pgstatindex_medium base= 5490.6ms patch= 37.9ms 144.88x
( 99.3%) (reads=2745→173, io_time=5481.54→7.82ms)
hash_vacuum_medium base= 34483.4ms patch= 2703.5ms 12.75x
( 92.2%) (reads=19166→3901, io_time=31948.33→308.05ms)
wal_logging_medium base= 7778.6ms patch= 7814.5ms 1.00x
( -0.5%) (reads=2857→2845, io_time=11.84→11.45ms)
-- worker, medium size
bloom_vacuum_medium base= 8376.2ms patch= 747.7ms 11.20x
( 91.1%) (reads=4732→1056, io_time=7688.91→65.49ms)
pgstattuple_medium base= 4012.7ms patch= 339.0ms 11.84x
( 91.6%) (reads=2006→2006, io_time=4002.23→49.99ms)
pgstatindex_medium base= 5490.3ms patch= 38.3ms 143.23x
( 99.3%) (reads=2745→173, io_time=5480.60→16.24ms)
hash_vacuum_medium base= 34638.4ms patch= 2940.2ms 11.78x
( 91.5%) (reads=19166→3901, io_time=31881.61→242.01ms)
wal_logging_medium base= 7440.1ms patch= 7434.0ms 1.00x
( 0.1%) (reads=2861→2825, io_time=10.62→10.71ms)
-- Setting read delay only
sudo dmsetup reload "$DM_DELAY_DEV" --table "0 $size delay $dev 0 $ms $dev 0 0"
Setting dm_delay on delayed to 2ms read / 0ms write
After setting the write delay to 0ms, I can observe more pronounced
speedups overall, since vacuum operation is write-intensive — delaying
writes might dominate the runtime and mask the read-path improvement
we're measuring. It also speeds up the runtime of the test.
-- wal_logging
The wal_logging patch does not seem to benefit from streamification in
this configuration either.
-- Delay settup
For anyone wanting to reproduce the results with a simulated-latency
device, here is the setup I used.
1. Create a 50GB file-backed block device (enough for PG data + indexes)
sudo dd if=/dev/zero of=/srv/delay_disk.img bs=1M count=50000 status=progress
sudo losetup /dev/loop0 /srv/delay_disk.img
2. Create the dm_delay device with 2ms delay
sudo dmsetup create delayed --table "0 $(sudo blockdev --getsz
/dev/loop0) delay /dev/loop0 0 2"
3. Format and mount it
sudo mkfs.ext4 /dev/mapper/delayed
sudo mkdir -p /srv/pg_delayed
sudo mount /dev/mapper/delayed /srv/pg_delayed
sudo chown $(whoami) /srv/pg_delayed
4. Run benchmark with WORKROOT pointing to the delayed device
WORKROOT=/srv/pg_delayed SIZES=medium REPS=3 \
./run_streaming_benchmark.sh --baseline --io-method io_uring \
--test gin_vacuum --direct-io --io-delay 2 \
the targeted patch
--
Best,
Xuneng
From 2adfcd6c16f94e7dadb38ffc6cfed3457b363bf5 Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Sun, 28 Dec 2025 18:29:28 +0800 Subject: [PATCH v6 3/5] Streamify hash index VACUUM primary bucket page reads Refactor hashbulkdelete() to use the Read Stream for primary bucket pages. This enables prefetching of upcoming buckets while the current one is being processed, improving I/O efficiency during hash index vacuum operations. --- src/backend/access/hash/hash.c | 80 ++++++++++++++++++++++++++++++-- src/tools/pgindent/typedefs.list | 1 + 2 files changed, 78 insertions(+), 3 deletions(-) diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index e88ddb32a05..6df5e7ccbd1 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -30,6 +30,7 @@ #include "nodes/execnodes.h" #include "optimizer/plancat.h" #include "pgstat.h" +#include "storage/read_stream.h" #include "utils/fmgrprotos.h" #include "utils/index_selfuncs.h" #include "utils/rel.h" @@ -42,12 +43,23 @@ typedef struct Relation heapRel; /* heap relation descriptor */ } HashBuildState; +/* Working state for streaming reads in hashbulkdelete */ +typedef struct +{ + HashMetaPage metap; /* cached metapage for BUCKET_TO_BLKNO */ + Bucket next_bucket; /* next bucket to prefetch */ + Bucket max_bucket; /* stop when next_bucket > max_bucket */ +} HashBulkDeleteStreamPrivate; + static void hashbuildCallback(Relation index, ItemPointer tid, Datum *values, bool *isnull, bool tupleIsAlive, void *state); +static BlockNumber hash_bulkdelete_read_stream_cb(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data); /* @@ -451,6 +463,27 @@ hashendscan(IndexScanDesc scan) scan->opaque = NULL; } +/* + * Read stream callback for hashbulkdelete. + * + * Returns the block number of the primary page for the next bucket to + * vacuum, using the BUCKET_TO_BLKNO mapping from the cached metapage. + */ +static BlockNumber +hash_bulkdelete_read_stream_cb(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + HashBulkDeleteStreamPrivate *p = callback_private_data; + Bucket bucket; + + if (p->next_bucket > p->max_bucket) + return InvalidBlockNumber; + + bucket = p->next_bucket++; + return BUCKET_TO_BLKNO(p->metap, bucket); +} + /* * Bulk deletion of all index entries pointing to a set of heap tuples. * The set of target tuples is specified via a callback routine that tells @@ -475,6 +508,8 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, Buffer metabuf = InvalidBuffer; HashMetaPage metap; HashMetaPage cachedmetap; + HashBulkDeleteStreamPrivate stream_private; + ReadStream *stream = NULL; tuples_removed = 0; num_index_tuples = 0; @@ -495,7 +530,25 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, cur_bucket = 0; cur_maxbucket = orig_maxbucket; -loop_top: + /* Set up streaming read for primary bucket pages */ + stream_private.metap = cachedmetap; + stream_private.next_bucket = cur_bucket; + stream_private.max_bucket = cur_maxbucket; + + /* + * It is safe to use batchmode as hash_bulkdelete_read_stream_cb takes no + * locks. + */ + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | + READ_STREAM_USE_BATCHING, + info->strategy, + rel, + MAIN_FORKNUM, + hash_bulkdelete_read_stream_cb, + &stream_private, + 0); + +bucket_loop: while (cur_bucket <= cur_maxbucket) { BlockNumber bucket_blkno; @@ -515,7 +568,8 @@ loop_top: * We need to acquire a cleanup lock on the primary bucket page to out * wait concurrent scans before deleting the dead tuples. */ - buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy); + buf = read_stream_next_buffer(stream, NULL); + Assert(BufferIsValid(buf)); LockBufferForCleanup(buf); _hash_checkpage(rel, buf, LH_BUCKET_PAGE); @@ -546,6 +600,16 @@ loop_top: { cachedmetap = _hash_getcachedmetap(rel, &metabuf, true); Assert(cachedmetap != NULL); + + /* + * Reset stream with updated metadata for remaining buckets. + * The BUCKET_TO_BLKNO mapping depends on hashm_spares[], + * which may have changed. + */ + stream_private.metap = cachedmetap; + stream_private.next_bucket = cur_bucket + 1; + stream_private.max_bucket = cur_maxbucket; + read_stream_reset(stream); } } @@ -578,9 +642,19 @@ loop_top: cachedmetap = _hash_getcachedmetap(rel, &metabuf, true); Assert(cachedmetap != NULL); cur_maxbucket = cachedmetap->hashm_maxbucket; - goto loop_top; + + /* Reset stream to process additional buckets from split */ + stream_private.metap = cachedmetap; + stream_private.next_bucket = cur_bucket; + stream_private.max_bucket = cur_maxbucket; + read_stream_reset(stream); + goto bucket_loop; } + /* Stream should be exhausted since we processed all buckets */ + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); + /* Okay, we're really done. Update tuple count in metapage. */ START_CRIT_SECTION(); diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index a67246138eb..0d60a17bc2c 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -1185,6 +1185,7 @@ HashAggBatch HashAggSpill HashAllocFunc HashBuildState +HashBulkDeleteStreamPrivate HashCompareFunc HashCopyFunc HashIndexStat -- 2.51.0
From 4350511d40f5efed2be26c518cbb15c4c8435eb4 Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Sat, 27 Dec 2025 00:29:02 +0800 Subject: [PATCH v6 2/5] Streamify heap bloat estimation scan. Introduce a read-stream callback to skip all-visible pages via VM/FSM lookup and stream-read the rest, reducing page reads and improving pgstattuple_approx execution time on large relations. --- contrib/pgstattuple/pgstatapprox.c | 126 ++++++++++++++++++++++------- src/tools/pgindent/typedefs.list | 1 + 2 files changed, 96 insertions(+), 31 deletions(-) diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index 3fad24cf248..68ae7720b31 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -23,6 +23,7 @@ #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/procarray.h" +#include "storage/read_stream.h" PG_FUNCTION_INFO_V1(pgstattuple_approx); PG_FUNCTION_INFO_V1(pgstattuple_approx_v1_5); @@ -45,6 +46,61 @@ typedef struct output_type #define NUM_OUTPUT_COLUMNS 10 +/* + * Struct for statapprox_heap read stream callback. + */ +typedef struct StatApproxReadStreamPrivate +{ + Relation rel; + output_type *stat; + BlockNumber current_blocknum; + BlockNumber nblocks; + BlockNumber scanned; /* count of pages actually read */ + Buffer vmbuffer; /* for VM lookups */ +} StatApproxReadStreamPrivate; + +/* + * Read stream callback for statapprox_heap. + * + * This callback checks the visibility map for each block. If the block is + * all-visible, we can get the free space from the FSM without reading the + * actual page, and skip to the next block. Only blocks that are not + * all-visible are returned for actual reading. + */ +static BlockNumber +statapprox_heap_read_stream_next(ReadStream *stream, + void *callback_private_data, + void *per_buffer_data) +{ + StatApproxReadStreamPrivate *p = callback_private_data; + + while (p->current_blocknum < p->nblocks) + { + BlockNumber blkno = p->current_blocknum++; + Size freespace; + + CHECK_FOR_INTERRUPTS(); + + /* + * If the page has only visible tuples, then we can find out the free + * space from the FSM and move on without reading the page. + */ + if (VM_ALL_VISIBLE(p->rel, blkno, &p->vmbuffer)) + { + freespace = GetRecordedFreeSpace(p->rel, blkno); + p->stat->tuple_len += BLCKSZ - freespace; + p->stat->free_space += freespace; + continue; + } + + /* This block needs to be read */ + p->scanned++; + return blkno; + } + + return InvalidBlockNumber; +} + /* * This function takes an already open relation and scans its pages, * skipping those that have the corresponding visibility map bit set. @@ -58,53 +114,58 @@ typedef struct output_type static void statapprox_heap(Relation rel, output_type *stat) { - BlockNumber scanned, - nblocks, - blkno; - Buffer vmbuffer = InvalidBuffer; + BlockNumber nblocks; BufferAccessStrategy bstrategy; TransactionId OldestXmin; + StatApproxReadStreamPrivate p; + ReadStream *stream; OldestXmin = GetOldestNonRemovableTransactionId(rel); bstrategy = GetAccessStrategy(BAS_BULKREAD); nblocks = RelationGetNumberOfBlocks(rel); - scanned = 0; - for (blkno = 0; blkno < nblocks; blkno++) + /* Initialize read stream private data */ + p.rel = rel; + p.stat = stat; + p.current_blocknum = 0; + p.nblocks = nblocks; + p.scanned = 0; + p.vmbuffer = InvalidBuffer; + + /* + * Create the read stream. We don't use READ_STREAM_USE_BATCHING because + * the callback accesses the visibility map which may need to read VM + * pages. While this shouldn't cause deadlocks, we err on the side of + * caution. + */ + stream = read_stream_begin_relation(READ_STREAM_FULL, + bstrategy, + rel, + MAIN_FORKNUM, + statapprox_heap_read_stream_next, + &p, + 0); + + for (;;) { Buffer buf; Page page; OffsetNumber offnum, maxoff; - Size freespace; - - CHECK_FOR_INTERRUPTS(); - - /* - * If the page has only visible tuples, then we can find out the free - * space from the FSM and move on. - */ - if (VM_ALL_VISIBLE(rel, blkno, &vmbuffer)) - { - freespace = GetRecordedFreeSpace(rel, blkno); - stat->tuple_len += BLCKSZ - freespace; - stat->free_space += freespace; - continue; - } + BlockNumber blkno; - buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, - RBM_NORMAL, bstrategy); + buf = read_stream_next_buffer(stream, NULL); + if (buf == InvalidBuffer) + break; LockBuffer(buf, BUFFER_LOCK_SHARE); page = BufferGetPage(buf); + blkno = BufferGetBlockNumber(buf); stat->free_space += PageGetExactFreeSpace(page); - /* We may count the page as scanned even if it's new/empty */ - scanned++; - if (PageIsNew(page) || PageIsEmpty(page)) { UnlockReleaseBuffer(buf); @@ -169,6 +230,9 @@ statapprox_heap(Relation rel, output_type *stat) UnlockReleaseBuffer(buf); } + Assert(p.current_blocknum == nblocks); + read_stream_end(stream); + stat->table_len = (uint64) nblocks * BLCKSZ; /* @@ -179,7 +243,7 @@ statapprox_heap(Relation rel, output_type *stat) * tuples in all-visible pages, so no correction is needed for that, and * we already accounted for the space in those pages, too. */ - stat->tuple_count = vac_estimate_reltuples(rel, nblocks, scanned, + stat->tuple_count = vac_estimate_reltuples(rel, nblocks, p.scanned, stat->tuple_count); /* It's not clear if we could get -1 here, but be safe. */ @@ -190,16 +254,16 @@ statapprox_heap(Relation rel, output_type *stat) */ if (nblocks != 0) { - stat->scanned_percent = 100.0 * scanned / nblocks; + stat->scanned_percent = 100.0 * p.scanned / nblocks; stat->tuple_percent = 100.0 * stat->tuple_len / stat->table_len; stat->dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len; stat->free_percent = 100.0 * stat->free_space / stat->table_len; } - if (BufferIsValid(vmbuffer)) + if (BufferIsValid(p.vmbuffer)) { - ReleaseBuffer(vmbuffer); - vmbuffer = InvalidBuffer; + ReleaseBuffer(p.vmbuffer); + p.vmbuffer = InvalidBuffer; } } diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 141b9d6e077..a67246138eb 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -2917,6 +2917,7 @@ StartReplicationCmd StartupStatusEnum StatEntry StatExtEntry +StatApproxReadStreamPrivate StateFileChunk StatisticExtInfo StatsBuildData -- 2.51.0
From 085abea7e4c6998acdaf0ef96aa759ed30fd1d25 Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Thu, 12 Mar 2026 10:48:38 +0800 Subject: [PATCH v6 5/5] Use streaming read API in pgstatindex functions Replace synchronous ReadBufferExtended() loops with the streaming read API in pgstatindex_impl() and pgstathashindex(). Author: Xuneng Zhou <[email protected]> Reviewed-by: Nazir Bilal Yavuz <[email protected]> Reviewed-by: wenhui qiu <[email protected]> Reviewed-by: Shinya Kato <[email protected]> --- contrib/pgstattuple/pgstatindex.c | 65 +++++++++++++++++++++++++------ 1 file changed, 54 insertions(+), 11 deletions(-) diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c index ef723af1f19..2b3c351ecff 100644 --- a/contrib/pgstattuple/pgstatindex.c +++ b/contrib/pgstattuple/pgstatindex.c @@ -37,6 +37,7 @@ #include "funcapi.h" #include "miscadmin.h" #include "storage/bufmgr.h" +#include "storage/read_stream.h" #include "utils/rel.h" #include "utils/varlena.h" @@ -217,6 +218,8 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo) BlockNumber blkno; BTIndexStat indexStat; BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); + BlockRangeReadStreamPrivate p; + ReadStream *stream; if (!IS_INDEX(rel) || !IS_BTREE(rel)) ereport(ERROR, @@ -273,11 +276,29 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo) indexStat.fragments = 0; /* - * Scan all blocks except the metapage + * Scan all blocks except the metapage (0th page) using streaming reads */ nblocks = RelationGetNumberOfBlocks(rel); - for (blkno = 1; blkno < nblocks; blkno++) + BlockNumber startblk = BTREE_METAPAGE + 1; + + p.current_blocknum = startblk; + p.last_exclusive = nblocks; + + /* + * It is safe to use batchmode as block_range_read_stream_cb takes no + * locks. + */ + stream = read_stream_begin_relation(READ_STREAM_FULL | + READ_STREAM_USE_BATCHING, + bstrategy, + rel, + MAIN_FORKNUM, + block_range_read_stream_cb, + &p, + 0); + + for (blkno = startblk; blkno < nblocks; blkno++) { Buffer buffer; Page page; @@ -285,8 +306,7 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo) CHECK_FOR_INTERRUPTS(); - /* Read and lock buffer */ - buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy); + buffer = read_stream_next_buffer(stream, NULL); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); @@ -322,11 +342,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo) else indexStat.internal_pages++; - /* Unlock and release buffer */ - LockBuffer(buffer, BUFFER_LOCK_UNLOCK); - ReleaseBuffer(buffer); + UnlockReleaseBuffer(buffer); } + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); + relation_close(rel, AccessShareLock); /*---------------------------- @@ -600,6 +621,8 @@ pgstathashindex(PG_FUNCTION_ARGS) HashMetaPage metap; float8 free_percent; uint64 total_space; + BlockRangeReadStreamPrivate p; + ReadStream *stream; /* * This uses relation_open() and not index_open(). The latter allows @@ -644,16 +667,33 @@ pgstathashindex(PG_FUNCTION_ARGS) /* prepare access strategy for this index */ bstrategy = GetAccessStrategy(BAS_BULKREAD); - /* Start from blkno 1 as 0th block is metapage */ - for (blkno = 1; blkno < nblocks; blkno++) + /* Scan all blocks except the metapage (0th page) using streaming reads */ + BlockNumber startblk = HASH_METAPAGE + 1; + + p.current_blocknum = startblk; + p.last_exclusive = nblocks; + + /* + * It is safe to use batchmode as block_range_read_stream_cb takes no + * locks. + */ + stream = read_stream_begin_relation(READ_STREAM_FULL | + READ_STREAM_USE_BATCHING, + bstrategy, + rel, + MAIN_FORKNUM, + block_range_read_stream_cb, + &p, + 0); + + for (blkno = startblk; blkno < nblocks; blkno++) { Buffer buf; Page page; CHECK_FOR_INTERRUPTS(); - buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, - bstrategy); + buf = read_stream_next_buffer(stream, NULL); LockBuffer(buf, BUFFER_LOCK_SHARE); page = BufferGetPage(buf); @@ -698,6 +738,9 @@ pgstathashindex(PG_FUNCTION_ARGS) UnlockReleaseBuffer(buf); } + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); + /* Done accessing the index */ relation_close(rel, AccessShareLock); -- 2.51.0
From 14f1a6bed27acf0f517140b9b4f0afa4db64777b Mon Sep 17 00:00:00 2001 From: alterego655 <[email protected]> Date: Thu, 12 Mar 2026 10:41:41 +0800 Subject: [PATCH v6 4/5] Streamify log_newpage_range() WAL logging path Refactor log_newpage_range() to use the Read Stream API. This allows prefetching of upcoming relation blocks during bulk WAL logging operations, overlapping I/O with CPU-intensive XLogInsert and WAL-writing work. --- src/backend/access/transam/xloginsert.c | 26 +++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index ac3c1a78396..71ef1ea2052 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -38,6 +38,7 @@ #include "pg_trace.h" #include "replication/origin.h" #include "storage/bufmgr.h" +#include "storage/read_stream.h" #include "storage/proc.h" #include "utils/memutils.h" #include "utils/pgstat_internal.h" @@ -1296,6 +1297,8 @@ log_newpage_range(Relation rel, ForkNumber forknum, { int flags; BlockNumber blkno; + BlockRangeReadStreamPrivate p; + ReadStream *stream; flags = REGBUF_FORCE_IMAGE; if (page_std) @@ -1308,6 +1311,23 @@ log_newpage_range(Relation rel, ForkNumber forknum, */ XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0); + /* Set up a streaming read for the range of blocks */ + p.current_blocknum = startblk; + p.last_exclusive = endblk; + + /* + * It is safe to use batchmode as block_range_read_stream_cb takes no + * locks. + */ + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | + READ_STREAM_USE_BATCHING, + NULL, + rel, + forknum, + block_range_read_stream_cb, + &p, + 0); + blkno = startblk; while (blkno < endblk) { @@ -1322,8 +1342,7 @@ log_newpage_range(Relation rel, ForkNumber forknum, nbufs = 0; while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk) { - Buffer buf = ReadBufferExtended(rel, forknum, blkno, - RBM_NORMAL, NULL); + Buffer buf = read_stream_next_buffer(stream, NULL); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); @@ -1363,6 +1382,9 @@ log_newpage_range(Relation rel, ForkNumber forknum, for (i = 0; i < nbufs; i++) UnlockReleaseBuffer(bufpack[i]); } + + Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer); + read_stream_end(stream); } /* -- 2.51.0
run_streaming_benchmark.sh
Description: Bourne shell script
