On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman <melanieplage...@gmail.com> wrote: > > On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <and...@anarazel.de> wrote: > > > > ## AIO API overview > > > > The main steps to use AIO (without higher level helpers) are: > > > > 1) acquire an "unused" AIO: pgaio_io_get() > > > > 2) start some IO, this is done by functions like > > pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal) > > > > The (read|write|fsync|flush_range) indicates the operation, whereas > > (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled. > > > > (see below for more details about this design choice - it might or not be > > right) > > > > 3) optionally: assign a backend-local completion callback to the IO > > (pgaio_io_on_completion_local()) > > > > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be > > put on a per-backend list of pending IOs. The pending IOs can be > > explicitly > > be flushed pgaio_submit_pending(), but will also be submitted if the > > pending list gets to be too large, or if the current backend waits for > > the > > IO. > > > > The are two main reasons not to submit the IO immediately: > > - If adjacent, we can merge several IOs into one "kernel level" IO during > > submission. Larger IOs are considerably more efficient. > > - Several AIO APIs allow to submit a batch of IOs in one system call. > > > > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current > > backend. When other backends may need to wait for an IO to finish, > > pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a > > BufferDesc), which can be waited for using pgaio_io_wait_ref(). > > > > 6) Process the results of the request. If a callback was registered in 3), > > this isn't always necessary. The results of AIO can be accessed using > > pgaio_io_result() which returns an integer where negative numbers are > > -errno, and positive numbers are the [partial] success conditions > > (e.g. potentially indicating a short read). > > > > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for > > another operation (pgaio_io_recycle()) > > > > > > Most places that want to use AIO shouldn't themselves need to care about > > managing the number of writes in flight, or the readahead distance. To help > > with that there are two helper utilities, a "streaming read" and a > > "streaming > > write". > > > > The "streaming read" helper uses a callback to determine which blocks to > > prefetch - that allows to do readahead in a sequential fashion but > > importantly > > also allows to asynchronously "read ahead" non-sequential blocks. > > > > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility > > map > > to figure out which block needs to be read next. Similarly > > lazy_vacuum_heap() > > uses the tids in LVDeadTuples to figure out which blocks are going to be > > needed. Here's the latter as an example: > > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906 > > > > Attached is a patch on top of the AIO branch which does bitmapheapscan > prefetching using the PgStreamingRead helper already used by sequential > scan and vacuum on the AIO branch. > > The prefetch iterator is removed and the main iterator in the > BitmapHeapScanState node is now used by the PgStreamingRead helper. > ... > > Oh, and I haven't done testing to see how effective the prefetching is > -- that is a larger project that I have yet to tackle. >
I have done some testing on how effective it is now. I've also updated the original patch to count the first page (in the lossy/exact page counts mentioned down-thread) as well as to remove unused prefetch fields and comments. I've also included a second patch which adds IO wait time information to EXPLAIN output when used like: EXPLAIN (buffers, analyze) SELECT ... The same commit also introduces a temporary dev GUC io_bitmap_prefetch_depth which I am using to experiment with the prefetch window size. I wanted to share some results from changing the prefetch window to demonstrate how prefetching is working. The short version of my results is that the prefetching works: - with the prefetch window set to 1, the IO wait time is 1550 ms - with the prefetch window set to 128, the IO wait time is 0.18 ms DDL and repro details below: On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set applied built with the following build flags: -02 -fno-omit-frame-pointer --with-liburing And these non-default PostgreSQL settings: io_data_direct=1 io_data_force_async=off io_method=io_uring log_min_duration_statement=0 log_duration=on set track_io_timing to on; set max_parallel_workers_per_gather to 0; set enable_seqscan to off; set enable_indexscan to off; set enable_bitmapscan to on; set effective_io_concurrency to 128; set io_bitmap_prefetch_depth to 128; Using this DDL: drop table if exists bar; create table bar(a int, b text, c text, d int); create index bar_idx on bar(a); insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i; insert into bar select i%3, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i; insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,200)i; insert into bar select i%100, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i%2000, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i%10, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i%100, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i; insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i; insert into bar select i%10, md5(i::text), 'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i; analyze; And this query: select * from bar where a > 100 offset 10000000000000; with the prefetch window set to 1, the query execution time is: 5496.129 ms and IO wait time is: 1550.915 mplageman=# explain (buffers, analyze, timing off) select * from bar where a > 100 offset 10000000000000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1) Buffers: shared hit=1 read=280571 I/O Timings: read=1315.845 wait=1550.915 -> Bitmap Heap Scan on bar (cost=240521.25..1462959.87 rows=19270298 width=68) (actual rows=19497800 loops=1) Recheck Cond: (a > 100) Rows Removed by Index Recheck: 400281 Heap Blocks: exact=47915 lossy=197741 Buffers: shared hit=1 read=280571 I/O Timings: read=1315.845 wait=1550.915 -> Bitmap Index Scan on bar_idx (cost=0.00..235703.67 rows=19270298 width=0) (actual rows=19497800 loops=1) Index Cond: (a > 100) Buffers: shared hit=1 read=34915 I/O Timings: read=1315.845 Planning: Buffers: shared hit=96 read=30 I/O Timings: read=3.399 Planning Time: 4.378 ms Execution Time: 5473.404 ms (18 rows) with the prefetch window set to 128, the query execution time is: 3222 ms and IO wait time is; 0.178 ms mplageman=# explain (buffers, analyze, timing off) select * from bar where a > 100 offset 10000000000000; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Limit (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1) Buffers: shared hit=1 read=280571 I/O Timings: read=1339.795 wait=0.178 -> Bitmap Heap Scan on bar (cost=240521.25..1462959.87 rows=19270298 width=68) (actual rows=19497800 loops=1) Recheck Cond: (a > 100) Rows Removed by Index Recheck: 400281 Heap Blocks: exact=47915 lossy=197741 Buffers: shared hit=1 read=280571 I/O Timings: read=1339.795 wait=0.178 -> Bitmap Index Scan on bar_idx (cost=0.00..235703.67 rows=19270298 width=0) (actual rows=19497800 loops=1) Index Cond: (a > 100) Buffers: shared hit=1 read=34915 I/O Timings: read=1339.795 Planning: Buffers: shared hit=96 read=30 I/O Timings: read=3.488 Planning Time: 4.279 ms Execution Time: 3434.522 ms (18 rows) - Melanie [1] https://github.com/anarazel/postgres/tree/aio
From 752fd6cc636eea08c06b25d8898e091835442387 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Thu, 5 Aug 2021 15:47:50 -0400 Subject: [PATCH v2 2/2] Add IO wait time stat and add guc for BHS prefetch - Add an IO wait time measurement which can be seen in EXPLAIN output (with explain buffers option and track_io_timing on) - TODO: add the wait time to database statistics - Also, add a GUC to control the BHS pre-fetch max window size --- src/backend/access/heap/heapam_handler.c | 2 +- src/backend/commands/explain.c | 9 ++++++++- src/backend/executor/instrument.c | 4 ++++ src/backend/postmaster/pgstat.c | 6 ++++++ src/backend/storage/aio/aio.c | 16 ++++++++++++++++ src/backend/storage/aio/aio_util.c | 2 -- src/backend/storage/buffer/bufmgr.c | 2 ++ src/backend/utils/adt/pgstatfuncs.c | 2 ++ src/backend/utils/misc/guc.c | 10 ++++++++++ src/include/executor/instrument.h | 1 + src/include/pgstat.h | 6 ++++++ src/include/storage/bufmgr.h | 1 + 12 files changed, 57 insertions(+), 4 deletions(-) diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index a8bd8050dc..7409dd2fb3 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -2221,7 +2221,7 @@ void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate) HeapScanDesc hscan = (HeapScanDesc ) scanstate->ss.ss_currentScanDesc; if (!hscan->rs_inited) { - int iodepth = Max(Min(128, NBuffers / 128), 1); + int iodepth = io_bitmap_prefetch_depth; hscan->pgsr = pg_streaming_read_alloc(iodepth, (uintptr_t) scanstate, bitmapheapscan_pgsr_next_single, bitmapheapscan_pgsr_release); diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index c66d39a5c7..d1e89f1e1b 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -3339,7 +3339,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning) bool has_temp = (usage->temp_blks_read > 0 || usage->temp_blks_written > 0); bool has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) || - !INSTR_TIME_IS_ZERO(usage->blk_write_time)); + !INSTR_TIME_IS_ZERO(usage->blk_write_time || + !INSTR_TIME_IS_ZERO(usage->io_wait_time))); bool show_planning = (planning && (has_shared || has_local || has_temp || has_timing)); @@ -3416,6 +3417,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning) if (!INSTR_TIME_IS_ZERO(usage->blk_write_time)) appendStringInfo(es->str, " write=%0.3f", INSTR_TIME_GET_MILLISEC(usage->blk_write_time)); + if (!INSTR_TIME_IS_ZERO(usage->io_wait_time)) + appendStringInfo(es->str, " wait=%0.3f", + INSTR_TIME_GET_MILLISEC(usage->io_wait_time)); appendStringInfoChar(es->str, '\n'); } @@ -3452,6 +3456,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning) ExplainPropertyFloat("I/O Write Time", "ms", INSTR_TIME_GET_MILLISEC(usage->blk_write_time), 3, es); + ExplainPropertyFloat("I/O Wait Time", "ms", + INSTR_TIME_GET_MILLISEC(usage->io_wait_time), + 5, es); } } } diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c index 96af2a2673..85e7321d63 100644 --- a/src/backend/executor/instrument.c +++ b/src/backend/executor/instrument.c @@ -218,6 +218,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add) dst->temp_blks_written += add->temp_blks_written; INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time); INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time); + INSTR_TIME_ADD(dst->io_wait_time, add->io_wait_time); + } /* dst += add - sub */ @@ -240,6 +242,8 @@ BufferUsageAccumDiff(BufferUsage *dst, add->blk_read_time, sub->blk_read_time); INSTR_TIME_ACCUM_DIFF(dst->blk_write_time, add->blk_write_time, sub->blk_write_time); + INSTR_TIME_ACCUM_DIFF(dst->io_wait_time, + add->io_wait_time, sub->io_wait_time); } /* helper functions for WAL usage accumulation */ diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 9d335b8507..028ca14aa8 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -258,6 +258,7 @@ static int pgStatXactCommit = 0; static int pgStatXactRollback = 0; PgStat_Counter pgStatBlockReadTime = 0; PgStat_Counter pgStatBlockWriteTime = 0; +PgStat_Counter pgStatIOWaitTime = 0; static PgStat_Counter pgStatActiveTime = 0; static PgStat_Counter pgStatTransactionIdleTime = 0; SessionEndType pgStatSessionEndCause = DISCONNECT_NORMAL; @@ -1004,10 +1005,12 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg) tsmsg->m_xact_rollback = pgStatXactRollback; tsmsg->m_block_read_time = pgStatBlockReadTime; tsmsg->m_block_write_time = pgStatBlockWriteTime; + tsmsg->m_io_wait_time = pgStatIOWaitTime; pgStatXactCommit = 0; pgStatXactRollback = 0; pgStatBlockReadTime = 0; pgStatBlockWriteTime = 0; + pgStatIOWaitTime = 0; } else { @@ -1015,6 +1018,7 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg) tsmsg->m_xact_rollback = 0; tsmsg->m_block_read_time = 0; tsmsg->m_block_write_time = 0; + tsmsg->m_io_wait_time = 0; } n = tsmsg->m_nentries; @@ -5122,6 +5126,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry) dbentry->last_checksum_failure = 0; dbentry->n_block_read_time = 0; dbentry->n_block_write_time = 0; + dbentry->n_io_wait_time = 0; dbentry->n_sessions = 0; dbentry->total_session_time = 0; dbentry->total_active_time = 0; @@ -6442,6 +6447,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len) dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback); dbentry->n_block_read_time += msg->m_block_read_time; dbentry->n_block_write_time += msg->m_block_write_time; + dbentry->n_io_wait_time += msg->m_io_wait_time; /* * Process all table entries in the message. diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c index 726d50f698..c5aacbcf9d 100644 --- a/src/backend/storage/aio/aio.c +++ b/src/backend/storage/aio/aio.c @@ -36,6 +36,8 @@ #include "utils/guc.h" #include "utils/memutils.h" #include "utils/resowner_private.h" +#include "executor/instrument.h" +#include "storage/bufmgr.h" #define PGAIO_VERBOSE @@ -1729,6 +1731,12 @@ wait_ref_again: } else if (io_method != IOMETHOD_WORKER && (flags & PGAIOIP_INFLIGHT)) { + int io_wait_start, io_wait_time; + if (track_io_timing) + INSTR_TIME_SET_CURRENT(io_wait_start); + else + INSTR_TIME_SET_ZERO(io_wait_start); + /* note that this is allowed to spuriously return */ if (io_method == IOMETHOD_WORKER) ConditionVariableSleep(&io->cv, WAIT_EVENT_AIO_IO_COMPLETE_ONE); @@ -1741,6 +1749,14 @@ wait_ref_again: else if (io_method == IOMETHOD_POSIX) pgaio_posix_wait_one(io, ref_generation); #endif + + if (track_io_timing) + { + INSTR_TIME_SET_CURRENT(io_wait_time); + INSTR_TIME_SUBTRACT(io_wait_time, io_wait_start); + pgstat_count_io_wait_time(INSTR_TIME_GET_MICROSEC(io_wait_time)); + INSTR_TIME_ADD(pgBufferUsage.io_wait_time, io_wait_time); + } } #ifdef USE_POSIX_AIO /* XXX untangle this */ diff --git a/src/backend/storage/aio/aio_util.c b/src/backend/storage/aio/aio_util.c index 35436dfcd3..a79db4d747 100644 --- a/src/backend/storage/aio/aio_util.c +++ b/src/backend/storage/aio/aio_util.c @@ -417,8 +417,6 @@ pg_streaming_read_alloc(uint32 iodepth, uintptr_t pgsr_private, { PgStreamingRead *pgsr; - iodepth = Max(Min(iodepth, NBuffers / 128), 1); - pgsr = palloc0(offsetof(PgStreamingRead, all_items) + sizeof(PgStreamingReadItem) * iodepth * 2); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f5df1b78f2..69175699b0 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -145,6 +145,8 @@ bool track_io_timing = false; */ int effective_io_concurrency = 0; +int io_bitmap_prefetch_depth = 128; + /* * Like effective_io_concurrency, but used by maintenance code paths that might * benefit from a higher setting because they work on behalf of many sessions. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 2fca05f7af..a6d4f121e1 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -1615,6 +1615,8 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS) PG_RETURN_FLOAT8(result); } +// TODO: add one for io wait time + Datum pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS) { diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index d143783f22..f4f02fb9ad 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3029,6 +3029,16 @@ static struct config_int ConfigureNamesInt[] = 0, MAX_IO_CONCURRENCY, check_maintenance_io_concurrency, NULL, NULL }, + { + {"io_bitmap_prefetch_depth", PGC_USERSET, RESOURCES_ASYNCHRONOUS, + gettext_noop("Maximum pre-fetch distance for bitmapheapscan"), + NULL, + GUC_EXPLAIN + }, + &io_bitmap_prefetch_depth, + 128, 0, MAX_IO_CONCURRENCY, + NULL, NULL, NULL + }, { {"io_wal_concurrency", PGC_POSTMASTER, RESOURCES_ASYNCHRONOUS, diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h index 62ca398e9d..950414cc0b 100644 --- a/src/include/executor/instrument.h +++ b/src/include/executor/instrument.h @@ -30,6 +30,7 @@ typedef struct BufferUsage long temp_blks_written; /* # of temp blocks written */ instr_time blk_read_time; /* time spent reading */ instr_time blk_write_time; /* time spent writing */ + instr_time io_wait_time; } BufferUsage; typedef struct WalUsage diff --git a/src/include/pgstat.h b/src/include/pgstat.h index b77dcbc58b..a02813f02f 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -291,6 +291,7 @@ typedef struct PgStat_MsgTabstat int m_xact_rollback; PgStat_Counter m_block_read_time; /* times in microseconds */ PgStat_Counter m_block_write_time; + PgStat_Counter m_io_wait_time; PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES]; } PgStat_MsgTabstat; @@ -732,6 +733,7 @@ typedef struct PgStat_StatDBEntry TimestampTz last_checksum_failure; PgStat_Counter n_block_read_time; /* times in microseconds */ PgStat_Counter n_block_write_time; + PgStat_Counter n_io_wait_time; PgStat_Counter n_sessions; PgStat_Counter total_session_time; PgStat_Counter total_active_time; @@ -1417,6 +1419,8 @@ extern PgStat_MsgWal WalStats; extern PgStat_Counter pgStatBlockReadTime; extern PgStat_Counter pgStatBlockWriteTime; +extern PgStat_Counter pgStatIOWaitTime; + /* * Updated by the traffic cop and in errfinish() */ @@ -1593,6 +1597,8 @@ pgstat_report_wait_end(void) (pgStatBlockReadTime += (n)) #define pgstat_count_buffer_write_time(n) \ (pgStatBlockWriteTime += (n)) +#define pgstat_count_io_wait_time(n) \ + (pgStatIOWaitTime += (n)) extern void pgstat_count_heap_insert(Relation rel, PgStat_Counter n); extern void pgstat_count_heap_update(Relation rel, bool hot); diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 07401f8493..8c21bb6e56 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -70,6 +70,7 @@ extern int bgwriter_lru_maxpages; extern double bgwriter_lru_multiplier; extern bool track_io_timing; extern int effective_io_concurrency; +extern int io_bitmap_prefetch_depth; extern int maintenance_io_concurrency; extern int checkpoint_flush_after; -- 2.27.0
From 4ef0eaf4aed17becd94d085c8092b8b79d7bca93 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Tue, 22 Jun 2021 16:14:58 -0400 Subject: [PATCH v2 1/2] Use pgsr for AIO bitmapheapscan --- src/backend/access/gin/ginget.c | 18 +- src/backend/access/gin/ginscan.c | 4 + src/backend/access/heap/heapam_handler.c | 190 +++++++- src/backend/executor/nodeBitmapHeapscan.c | 505 ++-------------------- src/backend/nodes/tidbitmap.c | 55 ++- src/include/access/gin_private.h | 5 + src/include/access/heapam.h | 2 + src/include/access/tableam.h | 4 +- src/include/executor/nodeBitmapHeapscan.h | 1 + src/include/nodes/execnodes.h | 24 +- src/include/nodes/tidbitmap.h | 7 +- src/include/storage/aio.h | 2 +- 12 files changed, 279 insertions(+), 538 deletions(-) diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c index 03191e016c..ef7c284cd0 100644 --- a/src/backend/access/gin/ginget.c +++ b/src/backend/access/gin/ginget.c @@ -311,6 +311,8 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack, } } +#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage + /* * Start* functions setup beginning state of searches: finds correct buffer and pins it. */ @@ -332,6 +334,7 @@ restartScanEntry: entry->nlist = 0; entry->matchBitmap = NULL; entry->matchResult = NULL; + entry->savedMatchResult = NULL; entry->reduceResult = false; entry->predictNumberResult = 0; @@ -372,7 +375,10 @@ restartScanEntry: if (entry->matchBitmap) { if (entry->matchIterator) + { tbm_end_iterate(entry->matchIterator); + pfree(entry->savedMatchResult); + } entry->matchIterator = NULL; tbm_free(entry->matchBitmap); entry->matchBitmap = NULL; @@ -385,6 +391,8 @@ restartScanEntry: if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap)) { entry->matchIterator = tbm_begin_iterate(entry->matchBitmap); + entry->savedMatchResult = (TBMIterateResult *) palloc0(sizeof(TBMIterateResult) + + MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)); entry->isFinished = false; } } @@ -790,6 +798,7 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry, #define gin_rand() (((double) random()) / ((double) MAX_RANDOM_VALUE)) #define dropItem(e) ( gin_rand() > ((double)GinFuzzySearchLimit)/((double)((e)->predictNumberResult)) ) + /* * Sets entry->curItem to next heap item pointer > advancePast, for one entry * of one scan key, or sets entry->isFinished to true if there are no more. @@ -817,7 +826,6 @@ entryGetItem(GinState *ginstate, GinScanEntry entry, /* A bitmap result */ BlockNumber advancePastBlk = GinItemPointerGetBlockNumber(&advancePast); OffsetNumber advancePastOff = GinItemPointerGetOffsetNumber(&advancePast); - for (;;) { /* @@ -831,12 +839,18 @@ entryGetItem(GinState *ginstate, GinScanEntry entry, (ItemPointerIsLossyPage(&advancePast) && entry->matchResult->blockno == advancePastBlk)) { - entry->matchResult = tbm_iterate(entry->matchIterator); + + tbm_iterate(entry->matchIterator, entry->savedMatchResult); + if (!BlockNumberIsValid(entry->savedMatchResult->blockno)) + entry->matchResult = NULL; + else + entry->matchResult = entry->savedMatchResult; if (entry->matchResult == NULL) { ItemPointerSetInvalid(&entry->curItem); tbm_end_iterate(entry->matchIterator); + pfree(entry->savedMatchResult); entry->matchIterator = NULL; entry->isFinished = true; break; diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c index 55e2d49fd7..3fd9310887 100644 --- a/src/backend/access/gin/ginscan.c +++ b/src/backend/access/gin/ginscan.c @@ -107,6 +107,7 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum, scanEntry->matchBitmap = NULL; scanEntry->matchIterator = NULL; scanEntry->matchResult = NULL; + scanEntry->savedMatchResult = NULL; scanEntry->list = NULL; scanEntry->nlist = 0; scanEntry->offset = InvalidOffsetNumber; @@ -246,7 +247,10 @@ ginFreeScanKeys(GinScanOpaque so) if (entry->list) pfree(entry->list); if (entry->matchIterator) + { tbm_end_iterate(entry->matchIterator); + pfree(entry->savedMatchResult); + } if (entry->matchBitmap) tbm_free(entry->matchBitmap); } diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 9c65741c41..a8bd8050dc 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -27,6 +27,7 @@ #include "access/syncscan.h" #include "access/tableam.h" #include "access/tsmapi.h" +#include "access/visibilitymap.h" #include "access/xact.h" #include "catalog/catalog.h" #include "catalog/index.h" @@ -36,6 +37,7 @@ #include "executor/executor.h" #include "miscadmin.h" #include "pgstat.h" +#include "storage/aio.h" #include "storage/bufmgr.h" #include "storage/bufpage.h" #include "storage/lmgr.h" @@ -57,6 +59,11 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan); static const TableAmRoutine heapam_methods; +static PgStreamingReadNextStatus +bitmapheapscan_pgsr_next_single(uintptr_t pgsr_private, PgAioInProgress *aio, uintptr_t *read_private); +static void +bitmapheapscan_pgsr_release(uintptr_t pgsr_private, uintptr_t read_private); + /* ------------------------------------------------------------------------ * Slot related callbacks for heap AM @@ -2106,13 +2113,128 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths, * Executor related callbacks for the heap AM * ------------------------------------------------------------------------ */ +#define MAX_TUPLES_PER_PAGE MaxHeapTuplesPerPage + +// TODO: for heap, these are in heapam.c instead of heapam_handler.c +// but, heap may move where it does the setup of pgsr +static void +bitmapheapscan_pgsr_release(uintptr_t pgsr_private, uintptr_t read_private) +{ + BitmapHeapScanState *bhs_state = (BitmapHeapScanState *) pgsr_private; + HeapScanDesc hdesc = (HeapScanDesc ) bhs_state->ss.ss_currentScanDesc; + TBMIterateResult *tbmres = (TBMIterateResult *) read_private; + + ereport(DEBUG3, + errmsg("pgsr %s: releasing buf %d", + NameStr(hdesc->rs_base.rs_rd->rd_rel->relname), + tbmres->buffer), + errhidestmt(true), + errhidecontext(true)); + + Assert(BufferIsValid(tbmres->buffer)); + ReleaseBuffer(tbmres->buffer); +} + +static PgStreamingReadNextStatus +bitmapheapscan_pgsr_next_single(uintptr_t pgsr_private, PgAioInProgress *aio, uintptr_t *read_private) +{ + bool already_valid; + bool skip_fetch; + BitmapHeapScanState *bhs_state = (BitmapHeapScanState *) pgsr_private; + /* + * TODO: instead of passing the BitmapHeapScanState node when setting up + * and ultimately using it here as pgsr_private, perhaps I can pass only the + * iterator by adding a pointer to the HeapScanDesc to the iterator and + * moving the vmbuffer into the heapscandesc and also add can_skip_fetch to + * the iterator and then pass the iterator as the private state. + * If doing this, will need a separate bitmapheapscan_pgsr_next_parallel() in + * addition to the bitmapheapscan_pgsr_next_single() which would use the + * shared_tbmiterator instead of the tbmiterator() (and then would need separate + * alloc functions for setup and potentially different release functions). + */ + ParallelBitmapHeapState *pstate = bhs_state->pstate; + HeapScanDesc hdesc = (HeapScanDesc ) bhs_state->ss.ss_currentScanDesc; + TBMIterateResult *tbmres = (TBMIterateResult *) palloc0(sizeof(TBMIterateResult) + + MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)); + Assert(bhs_state->initialized); + if (pstate == NULL) + tbm_iterate(bhs_state->tbmiterator, tbmres); + else + tbm_shared_iterate(bhs_state->shared_tbmiterator, tbmres); + + // TODO: could this be invalid for another reason than hit_end? + if (!BlockNumberIsValid(tbmres->blockno)) + { + pfree(tbmres); + tbmres = NULL; + *read_private = 0; + return PGSR_NEXT_END; + } + /* + * Ignore any claimed entries past what we think is the end of the + * relation. It may have been extended after the start of our scan (we + * only hold an AccessShareLock, and it could be inserts from this + * backend). + */ + if (tbmres->blockno >= hdesc->rs_nblocks) + { + tbmres->blockno = InvalidBlockNumber; + *read_private = (uintptr_t) tbmres; + return PGSR_NEXT_NO_IO; + } + + /* + * We can skip fetching the heap page if we don't need any fields + * from the heap, and the bitmap entries don't need rechecking, + * and all tuples on the page are visible to our transaction. + */ + skip_fetch = (bhs_state->can_skip_fetch && !tbmres->recheck && + VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, + &bhs_state->vmbuffer)); + + if (skip_fetch) + { + /* + * The number of tuples on this page is put into + * node->return_empty_tuples. + */ + tbmres->buffer = InvalidBuffer; + *read_private = (uintptr_t) tbmres; + return PGSR_NEXT_NO_IO; + } + tbmres->buffer = ReadBufferAsync(hdesc->rs_base.rs_rd, + MAIN_FORKNUM, + tbmres->blockno, + RBM_NORMAL, hdesc->rs_strategy, &already_valid, + &aio); + *read_private = (uintptr_t) tbmres; + + if (already_valid) + return PGSR_NEXT_NO_IO; + else + return PGSR_NEXT_IO; +} + +// TODO: put this in the right place +void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate) +{ + HeapScanDesc hscan = (HeapScanDesc ) scanstate->ss.ss_currentScanDesc; + if (!hscan->rs_inited) + { + int iodepth = Max(Min(128, NBuffers / 128), 1); + hscan->pgsr = pg_streaming_read_alloc(iodepth, (uintptr_t) scanstate, + bitmapheapscan_pgsr_next_single, + bitmapheapscan_pgsr_release); + + hscan->rs_inited = true; + } +} static bool heapam_scan_bitmap_next_block(TableScanDesc scan, - TBMIterateResult *tbmres) + TBMIterateResult **tbmres) { HeapScanDesc hscan = (HeapScanDesc) scan; - BlockNumber page = tbmres->blockno; Buffer buffer; Snapshot snapshot; int ntup; @@ -2120,22 +2242,35 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, hscan->rs_cindex = 0; hscan->rs_ntuples = 0; - /* - * Ignore any claimed entries past what we think is the end of the - * relation. It may have been extended after the start of our scan (we - * only hold an AccessShareLock, and it could be inserts from this - * backend). - */ - if (page >= hscan->rs_nblocks) + Assert(hscan->pgsr); + if (*tbmres) + { + if (BufferIsValid((*tbmres)->buffer)) + ReleaseBuffer((*tbmres)->buffer); + hscan->rs_cbuf = InvalidBuffer; + pfree(*tbmres); + } + + *tbmres = (TBMIterateResult *) pg_streaming_read_get_next(hscan->pgsr); + /* hit the end */ + if (*tbmres == NULL) + return true; + + /* Invalid due to past the end of the relation */ + if (!BlockNumberIsValid((*tbmres)->blockno)) + { + pfree(*tbmres); + *tbmres = NULL; return false; + } + + hscan->rs_cblock = (*tbmres)->blockno; + hscan->rs_cbuf = (*tbmres)->buffer; + + /* Skipped fetching, we'll still use ntuples though */ + if (!(BufferIsValid(hscan->rs_cbuf))) + return true; - /* - * Acquire pin on the target heap page, trading in any pin we held before. - */ - hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf, - scan->rs_rd, - page); - hscan->rs_cblock = page; buffer = hscan->rs_cbuf; snapshot = scan->rs_snapshot; @@ -2156,7 +2291,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, /* * We need two separate strategies for lossy and non-lossy cases. */ - if (tbmres->ntuples >= 0) + if ((*tbmres)->ntuples >= 0) { /* * Bitmap is non-lossy, so we just look through the offsets listed in @@ -2165,13 +2300,13 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, */ int curslot; - for (curslot = 0; curslot < tbmres->ntuples; curslot++) + for (curslot = 0; curslot < (*tbmres)->ntuples; curslot++) { - OffsetNumber offnum = tbmres->offsets[curslot]; + OffsetNumber offnum = (*tbmres)->offsets[curslot]; ItemPointerData tid; HeapTupleData heapTuple; - ItemPointerSet(&tid, page, offnum); + ItemPointerSet(&tid, (*tbmres)->blockno, offnum); if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot, &heapTuple, NULL, true)) hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid); @@ -2199,7 +2334,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan, loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp); loctup.t_len = ItemIdGetLength(lp); loctup.t_tableOid = scan->rs_rd->rd_id; - ItemPointerSet(&loctup.t_self, page, offnum); + ItemPointerSet(&loctup.t_self, (*tbmres)->blockno, offnum); valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); if (valid) { @@ -2230,6 +2365,19 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan, Page dp; ItemId lp; + /* we skipped fetching */ + if (BufferIsInvalid(tbmres->buffer)) + { + Assert(tbmres->ntuples >= 0); + if (tbmres->ntuples > 0) + { + ExecStoreAllNullTuple(slot); + tbmres->ntuples--; + return true; + } + Assert(tbmres->ntuples == 0); + return false; + } /* * Out of range? If so, nothing more to look at on this page */ diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 3861bd8a24..3ef25a8f36 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -36,6 +36,8 @@ #include "postgres.h" #include <math.h> +// TODO: delete me after moving scan setup function +#include "access/heapam.h" #include "access/relscan.h" #include "access/tableam.h" @@ -55,13 +57,13 @@ static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node); static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate); -static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node, - TBMIterateResult *tbmres); -static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node); -static inline void BitmapPrefetch(BitmapHeapScanState *node, - TableScanDesc scan); static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate); +// TODO: add to tableam +void table_bitmap_scan_setup(BitmapHeapScanState *scanstate) +{ + bitmapheap_pgsr_alloc(scanstate); +} /* ---------------------------------------------------------------- * BitmapHeapNext @@ -75,10 +77,7 @@ BitmapHeapNext(BitmapHeapScanState *node) ExprContext *econtext; TableScanDesc scan; TIDBitmap *tbm; - TBMIterator *tbmiterator = NULL; - TBMSharedIterator *shared_tbmiterator = NULL; - TBMIterateResult *tbmres; - TupleTableSlot *slot; + TupleTableSlot *slot = node->ss.ss_ScanTupleSlot; ParallelBitmapHeapState *pstate = node->pstate; dsa_area *dsa = node->ss.ps.state->es_query_dsa; @@ -89,23 +88,10 @@ BitmapHeapNext(BitmapHeapScanState *node) slot = node->ss.ss_ScanTupleSlot; scan = node->ss.ss_currentScanDesc; tbm = node->tbm; - if (pstate == NULL) - tbmiterator = node->tbmiterator; - else - shared_tbmiterator = node->shared_tbmiterator; - tbmres = node->tbmres; /* * If we haven't yet performed the underlying index scan, do it, and begin * the iteration over the bitmap. - * - * For prefetching, we use *two* iterators, one for the pages we are - * actually scanning and another that runs ahead of the first for - * prefetching. node->prefetch_pages tracks exactly how many pages ahead - * the prefetch iterator is. Also, node->prefetch_target tracks the - * desired prefetch distance, which starts small and increases up to the - * node->prefetch_maximum. This is to avoid doing a lot of prefetching in - * a scan that stops after a few tuples because of a LIMIT. */ if (!node->initialized) { @@ -117,17 +103,7 @@ BitmapHeapNext(BitmapHeapScanState *node) elog(ERROR, "unrecognized result from subplan"); node->tbm = tbm; - node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm); - node->tbmres = tbmres = NULL; - -#ifdef USE_PREFETCH - if (node->prefetch_maximum > 0) - { - node->prefetch_iterator = tbm_begin_iterate(tbm); - node->prefetch_pages = 0; - node->prefetch_target = -1; - } -#endif /* USE_PREFETCH */ + node->tbmiterator = tbm_begin_iterate(tbm); } else { @@ -143,180 +119,50 @@ BitmapHeapNext(BitmapHeapScanState *node) elog(ERROR, "unrecognized result from subplan"); node->tbm = tbm; - /* * Prepare to iterate over the TBM. This will return the * dsa_pointer of the iterator state which will be used by * multiple processes to iterate jointly. */ - pstate->tbmiterator = tbm_prepare_shared_iterate(tbm); -#ifdef USE_PREFETCH - if (node->prefetch_maximum > 0) - { - pstate->prefetch_iterator = - tbm_prepare_shared_iterate(tbm); - - /* - * We don't need the mutex here as we haven't yet woke up - * others. - */ - pstate->prefetch_pages = 0; - pstate->prefetch_target = -1; - } -#endif + pstate->tbmiterator = + tbm_prepare_shared_iterate(tbm); /* We have initialized the shared state so wake up others. */ BitmapDoneInitializingSharedState(pstate); } /* Allocate a private iterator and attach the shared state to it */ - node->shared_tbmiterator = shared_tbmiterator = + node->shared_tbmiterator = tbm_attach_shared_iterate(dsa, pstate->tbmiterator); - node->tbmres = tbmres = NULL; - -#ifdef USE_PREFETCH - if (node->prefetch_maximum > 0) - { - node->shared_prefetch_iterator = - tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator); - } -#endif /* USE_PREFETCH */ } node->initialized = true; + /* do any required setup, such as setting up streaming read helper */ + // TODO: modify for parallel as relevant + table_bitmap_scan_setup(node); + /* get the first block */ + while (!table_scan_bitmap_next_block(scan, &node->tbmres)); + if (node->tbmres == NULL) + return NULL; + + if (node->tbmres->ntuples >= 0) + node->exact_pages++; + else + node->lossy_pages++; } + + // TODO: seems like it would be more clear to have an independent function + // getting the next tuple and block and then only have the recheck here. + // the loop condition would be next_tuple != NULL for (;;) { - bool skip_fetch; - CHECK_FOR_INTERRUPTS(); - - /* - * Get next page of results if needed - */ - if (tbmres == NULL) - { - if (!pstate) - node->tbmres = tbmres = tbm_iterate(tbmiterator); - else - node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator); - if (tbmres == NULL) - { - /* no more entries in the bitmap */ - break; - } - - BitmapAdjustPrefetchIterator(node, tbmres); - - /* - * We can skip fetching the heap page if we don't need any fields - * from the heap, and the bitmap entries don't need rechecking, - * and all tuples on the page are visible to our transaction. - * - * XXX: It's a layering violation that we do these checks above - * tableam, they should probably moved below it at some point. - */ - skip_fetch = (node->can_skip_fetch && - !tbmres->recheck && - VM_ALL_VISIBLE(node->ss.ss_currentRelation, - tbmres->blockno, - &node->vmbuffer)); - - if (skip_fetch) - { - /* can't be lossy in the skip_fetch case */ - Assert(tbmres->ntuples >= 0); - - /* - * The number of tuples on this page is put into - * node->return_empty_tuples. - */ - node->return_empty_tuples = tbmres->ntuples; - } - else if (!table_scan_bitmap_next_block(scan, tbmres)) - { - /* AM doesn't think this block is valid, skip */ - continue; - } - - if (tbmres->ntuples >= 0) - node->exact_pages++; - else - node->lossy_pages++; - - /* Adjust the prefetch target */ - BitmapAdjustPrefetchTarget(node); - } - else - { - /* - * Continuing in previously obtained page. - */ - -#ifdef USE_PREFETCH - - /* - * Try to prefetch at least a few pages even before we get to the - * second page if we don't stop reading after the first tuple. - */ - if (!pstate) - { - if (node->prefetch_target < node->prefetch_maximum) - node->prefetch_target++; - } - else if (pstate->prefetch_target < node->prefetch_maximum) - { - /* take spinlock while updating shared state */ - SpinLockAcquire(&pstate->mutex); - if (pstate->prefetch_target < node->prefetch_maximum) - pstate->prefetch_target++; - SpinLockRelease(&pstate->mutex); - } -#endif /* USE_PREFETCH */ - } - - /* - * We issue prefetch requests *after* fetching the current page to try - * to avoid having prefetching interfere with the main I/O. Also, this - * should happen only when we have determined there is still something - * to do on the current page, else we may uselessly prefetch the same - * page we are just about to request for real. - * - * XXX: It's a layering violation that we do these checks above - * tableam, they should probably moved below it at some point. - */ - BitmapPrefetch(node, scan); - - if (node->return_empty_tuples > 0) + /* Attempt to fetch tuple from AM. */ + if (table_scan_bitmap_next_tuple(scan, node->tbmres, slot)) { - /* - * If we don't have to fetch the tuple, just return nulls. - */ - ExecStoreAllNullTuple(slot); - - if (--node->return_empty_tuples == 0) - { - /* no more tuples to return in the next round */ - node->tbmres = tbmres = NULL; - } - } - else - { - /* - * Attempt to fetch tuple from AM. - */ - if (!table_scan_bitmap_next_tuple(scan, tbmres, slot)) - { - /* nothing more to look at on this page */ - node->tbmres = tbmres = NULL; - continue; - } - - /* - * If we are using lossy info, we have to recheck the qual - * conditions at every tuple. - */ - if (tbmres->recheck) + // TODO: couldn't we have recheck set to true when it was only because + // the bitmap was lossy and not because the qual needs to be rechecked? + if (node->tbmres->recheck) { econtext->ecxt_scantuple = slot; if (!ExecQualAndReset(node->bitmapqualorig, econtext)) @@ -327,16 +173,23 @@ BitmapHeapNext(BitmapHeapScanState *node) continue; } } + return slot; } - /* OK to return this tuple */ - return slot; - } + /* + * Get next page of results + */ + while (!table_scan_bitmap_next_block(scan, &node->tbmres)); - /* - * if we get here it means we are at the end of the scan.. - */ - return ExecClearTuple(slot); + /* if we get here it means we are at the end of the scan */ + if (node->tbmres == NULL) + return NULL; + + if (node->tbmres->ntuples >= 0) + node->exact_pages++; + else + node->lossy_pages++; + } } /* @@ -354,235 +207,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate) ConditionVariableBroadcast(&pstate->cv); } -/* - * BitmapAdjustPrefetchIterator - Adjust the prefetch iterator - */ -static inline void -BitmapAdjustPrefetchIterator(BitmapHeapScanState *node, - TBMIterateResult *tbmres) -{ -#ifdef USE_PREFETCH - ParallelBitmapHeapState *pstate = node->pstate; - - if (pstate == NULL) - { - TBMIterator *prefetch_iterator = node->prefetch_iterator; - - if (node->prefetch_pages > 0) - { - /* The main iterator has closed the distance by one page */ - node->prefetch_pages--; - } - else if (prefetch_iterator) - { - /* Do not let the prefetch iterator get behind the main one */ - TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator); - - if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno) - elog(ERROR, "prefetch and main iterators are out of sync"); - } - return; - } - - if (node->prefetch_maximum > 0) - { - TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator; - - SpinLockAcquire(&pstate->mutex); - if (pstate->prefetch_pages > 0) - { - pstate->prefetch_pages--; - SpinLockRelease(&pstate->mutex); - } - else - { - /* Release the mutex before iterating */ - SpinLockRelease(&pstate->mutex); - - /* - * In case of shared mode, we can not ensure that the current - * blockno of the main iterator and that of the prefetch iterator - * are same. It's possible that whatever blockno we are - * prefetching will be processed by another process. Therefore, - * we don't validate the blockno here as we do in non-parallel - * case. - */ - if (prefetch_iterator) - tbm_shared_iterate(prefetch_iterator); - } - } -#endif /* USE_PREFETCH */ -} - -/* - * BitmapAdjustPrefetchTarget - Adjust the prefetch target - * - * Increase prefetch target if it's not yet at the max. Note that - * we will increase it to zero after fetching the very first - * page/tuple, then to one after the second tuple is fetched, then - * it doubles as later pages are fetched. - */ -static inline void -BitmapAdjustPrefetchTarget(BitmapHeapScanState *node) -{ -#ifdef USE_PREFETCH - ParallelBitmapHeapState *pstate = node->pstate; - - if (pstate == NULL) - { - if (node->prefetch_target >= node->prefetch_maximum) - /* don't increase any further */ ; - else if (node->prefetch_target >= node->prefetch_maximum / 2) - node->prefetch_target = node->prefetch_maximum; - else if (node->prefetch_target > 0) - node->prefetch_target *= 2; - else - node->prefetch_target++; - return; - } - - /* Do an unlocked check first to save spinlock acquisitions. */ - if (pstate->prefetch_target < node->prefetch_maximum) - { - SpinLockAcquire(&pstate->mutex); - if (pstate->prefetch_target >= node->prefetch_maximum) - /* don't increase any further */ ; - else if (pstate->prefetch_target >= node->prefetch_maximum / 2) - pstate->prefetch_target = node->prefetch_maximum; - else if (pstate->prefetch_target > 0) - pstate->prefetch_target *= 2; - else - pstate->prefetch_target++; - SpinLockRelease(&pstate->mutex); - } -#endif /* USE_PREFETCH */ -} - -/* - * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target - */ -static inline void -BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan) -{ - /* - * FIXME: This really should just all be replaced by using one iterator - * and a PgStreamingRead. tbm_iterate() actually does a fair bit of work, - * we don't want to repeat that. Nor is it good to do the buffer mapping - * lookups twice. - */ -#ifdef USE_PREFETCH - ParallelBitmapHeapState *pstate = node->pstate; - bool issued_prefetch = false; - - if (pstate == NULL) - { - TBMIterator *prefetch_iterator = node->prefetch_iterator; - - if (prefetch_iterator) - { - while (node->prefetch_pages < node->prefetch_target) - { - TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator); - bool skip_fetch; - - if (tbmpre == NULL) - { - /* No more pages to prefetch */ - tbm_end_iterate(prefetch_iterator); - node->prefetch_iterator = NULL; - break; - } - node->prefetch_pages++; - - /* - * If we expect not to have to actually read this heap page, - * skip this prefetch call, but continue to run the prefetch - * logic normally. (Would it be better not to increment - * prefetch_pages?) - * - * This depends on the assumption that the index AM will - * report the same recheck flag for this future heap page as - * it did for the current heap page; which is not a certainty - * but is true in many cases. - */ - skip_fetch = (node->can_skip_fetch && - (node->tbmres ? !node->tbmres->recheck : false) && - VM_ALL_VISIBLE(node->ss.ss_currentRelation, - tbmpre->blockno, - &node->pvmbuffer)); - - if (!skip_fetch) - { - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno); - issued_prefetch = true; - } - } - } - - return; - } - - if (pstate->prefetch_pages < pstate->prefetch_target) - { - TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator; - - if (prefetch_iterator) - { - while (1) - { - TBMIterateResult *tbmpre; - bool do_prefetch = false; - bool skip_fetch; - - /* - * Recheck under the mutex. If some other process has already - * done enough prefetching then we need not to do anything. - */ - SpinLockAcquire(&pstate->mutex); - if (pstate->prefetch_pages < pstate->prefetch_target) - { - pstate->prefetch_pages++; - do_prefetch = true; - } - SpinLockRelease(&pstate->mutex); - - if (!do_prefetch) - return; - - tbmpre = tbm_shared_iterate(prefetch_iterator); - if (tbmpre == NULL) - { - /* No more pages to prefetch */ - tbm_end_shared_iterate(prefetch_iterator); - node->shared_prefetch_iterator = NULL; - break; - } - - /* As above, skip prefetch if we expect not to need page */ - skip_fetch = (node->can_skip_fetch && - (node->tbmres ? !node->tbmres->recheck : false) && - VM_ALL_VISIBLE(node->ss.ss_currentRelation, - tbmpre->blockno, - &node->pvmbuffer)); - - if (!skip_fetch) - { - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno); - issued_prefetch = true; - } - } - } - } - - /* - * The PrefetchBuffer() calls staged IOs, but didn't necessarily submit - * them, as it is more efficient to amortize the syscall cost across - * multiple calls. - */ - if (issued_prefetch) - pgaio_submit_pending(true); -#endif /* USE_PREFETCH */ -} /* * BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual @@ -631,27 +255,18 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node) /* release bitmaps and buffers if any */ if (node->tbmiterator) tbm_end_iterate(node->tbmiterator); - if (node->prefetch_iterator) - tbm_end_iterate(node->prefetch_iterator); if (node->shared_tbmiterator) tbm_end_shared_iterate(node->shared_tbmiterator); - if (node->shared_prefetch_iterator) - tbm_end_shared_iterate(node->shared_prefetch_iterator); if (node->tbm) tbm_free(node->tbm); if (node->vmbuffer != InvalidBuffer) ReleaseBuffer(node->vmbuffer); - if (node->pvmbuffer != InvalidBuffer) - ReleaseBuffer(node->pvmbuffer); node->tbm = NULL; node->tbmiterator = NULL; node->tbmres = NULL; - node->prefetch_iterator = NULL; node->initialized = false; node->shared_tbmiterator = NULL; - node->shared_prefetch_iterator = NULL; node->vmbuffer = InvalidBuffer; - node->pvmbuffer = InvalidBuffer; ExecScanReScan(&node->ss); @@ -699,18 +314,12 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node) */ if (node->tbmiterator) tbm_end_iterate(node->tbmiterator); - if (node->prefetch_iterator) - tbm_end_iterate(node->prefetch_iterator); if (node->tbm) tbm_free(node->tbm); if (node->shared_tbmiterator) tbm_end_shared_iterate(node->shared_tbmiterator); - if (node->shared_prefetch_iterator) - tbm_end_shared_iterate(node->shared_prefetch_iterator); if (node->vmbuffer != InvalidBuffer) ReleaseBuffer(node->vmbuffer); - if (node->pvmbuffer != InvalidBuffer) - ReleaseBuffer(node->pvmbuffer); /* * close heap scan @@ -750,18 +359,12 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) scanstate->tbm = NULL; scanstate->tbmiterator = NULL; scanstate->tbmres = NULL; - scanstate->return_empty_tuples = 0; scanstate->vmbuffer = InvalidBuffer; - scanstate->pvmbuffer = InvalidBuffer; scanstate->exact_pages = 0; scanstate->lossy_pages = 0; - scanstate->prefetch_iterator = NULL; - scanstate->prefetch_pages = 0; - scanstate->prefetch_target = 0; scanstate->pscan_len = 0; scanstate->initialized = false; scanstate->shared_tbmiterator = NULL; - scanstate->shared_prefetch_iterator = NULL; scanstate->pstate = NULL; /* @@ -812,13 +415,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags) scanstate->bitmapqualorig = ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate); - /* - * Maximum number of prefetches for the tablespace if configured, - * otherwise the current value of the effective_io_concurrency GUC. - */ - scanstate->prefetch_maximum = - get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace); - scanstate->ss.ss_currentRelation = currentRelation; scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation, @@ -909,12 +505,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node, pstate = shm_toc_allocate(pcxt->toc, node->pscan_len); pstate->tbmiterator = 0; - pstate->prefetch_iterator = 0; /* Initialize the mutex */ SpinLockInit(&pstate->mutex); - pstate->prefetch_pages = 0; - pstate->prefetch_target = 0; pstate->state = BM_INITIAL; ConditionVariableInit(&pstate->cv); @@ -946,11 +539,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node, if (DsaPointerIsValid(pstate->tbmiterator)) tbm_free_shared_area(dsa, pstate->tbmiterator); - if (DsaPointerIsValid(pstate->prefetch_iterator)) - tbm_free_shared_area(dsa, pstate->prefetch_iterator); - pstate->tbmiterator = InvalidDsaPointer; - pstate->prefetch_iterator = InvalidDsaPointer; } /* ---------------------------------------------------------------- diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c index c5feacbff4..9ac0fa98d0 100644 --- a/src/backend/nodes/tidbitmap.c +++ b/src/backend/nodes/tidbitmap.c @@ -180,7 +180,6 @@ struct TBMIterator int spageptr; /* next spages index */ int schunkptr; /* next schunks index */ int schunkbit; /* next bit to check in current schunk */ - TBMIterateResult output; /* MUST BE LAST (because variable-size) */ }; /* @@ -221,7 +220,6 @@ struct TBMSharedIterator PTEntryArray *ptbase; /* pagetable element array */ PTIterationArray *ptpages; /* sorted exact page index list */ PTIterationArray *ptchunks; /* sorted lossy page index list */ - TBMIterateResult output; /* MUST BE LAST (because variable-size) */ }; /* Local function prototypes */ @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm) * Create the TBMIterator struct, with enough trailing space to serve the * needs of the TBMIterateResult sub-struct. */ - iterator = (TBMIterator *) palloc(sizeof(TBMIterator) + - MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)); + iterator = (TBMIterator *) palloc(sizeof(TBMIterator)); iterator->tbm = tbm; /* @@ -966,11 +963,10 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp) * be examined, but the condition must be rechecked anyway. (For ease of * testing, recheck is always set true when ntuples < 0.) */ -TBMIterateResult * -tbm_iterate(TBMIterator *iterator) +void +tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres) { TIDBitmap *tbm = iterator->tbm; - TBMIterateResult *output = &(iterator->output); Assert(tbm->iterating == TBM_ITERATING_PRIVATE); @@ -1008,11 +1004,11 @@ tbm_iterate(TBMIterator *iterator) chunk_blockno < tbm->spages[iterator->spageptr]->blockno) { /* Return a lossy page indicator from the chunk */ - output->blockno = chunk_blockno; - output->ntuples = -1; - output->recheck = true; + tbmres->blockno = chunk_blockno; + tbmres->ntuples = -1; + tbmres->recheck = true; iterator->schunkbit++; - return output; + return; } } @@ -1028,16 +1024,16 @@ tbm_iterate(TBMIterator *iterator) page = tbm->spages[iterator->spageptr]; /* scan bitmap to extract individual offset numbers */ - ntuples = tbm_extract_page_tuple(page, output); - output->blockno = page->blockno; - output->ntuples = ntuples; - output->recheck = page->recheck; + ntuples = tbm_extract_page_tuple(page, tbmres); + tbmres->blockno = page->blockno; + tbmres->ntuples = ntuples; + tbmres->recheck = page->recheck; iterator->spageptr++; - return output; + return; } /* Nothing more in the bitmap */ - return NULL; + tbmres->blockno = InvalidBlockNumber; } /* @@ -1047,10 +1043,9 @@ tbm_iterate(TBMIterator *iterator) * across multiple processes. We need to acquire the iterator LWLock, * before accessing the shared members. */ -TBMIterateResult * -tbm_shared_iterate(TBMSharedIterator *iterator) +void +tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres) { - TBMIterateResult *output = &iterator->output; TBMSharedIteratorState *istate = iterator->state; PagetableEntry *ptbase = NULL; int *idxpages = NULL; @@ -1101,13 +1096,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator) chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno) { /* Return a lossy page indicator from the chunk */ - output->blockno = chunk_blockno; - output->ntuples = -1; - output->recheck = true; + tbmres->blockno = chunk_blockno; + tbmres->ntuples = -1; + tbmres->recheck = true; istate->schunkbit++; LWLockRelease(&istate->lock); - return output; + return; } } @@ -1117,21 +1112,21 @@ tbm_shared_iterate(TBMSharedIterator *iterator) int ntuples; /* scan bitmap to extract individual offset numbers */ - ntuples = tbm_extract_page_tuple(page, output); - output->blockno = page->blockno; - output->ntuples = ntuples; - output->recheck = page->recheck; + ntuples = tbm_extract_page_tuple(page, tbmres); + tbmres->blockno = page->blockno; + tbmres->ntuples = ntuples; + tbmres->recheck = page->recheck; istate->spageptr++; LWLockRelease(&istate->lock); - return output; + return; } LWLockRelease(&istate->lock); /* Nothing more in the bitmap */ - return NULL; + tbmres->blockno = InvalidBlockNumber; } /* diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h index 670a40b4be..1122c098c7 100644 --- a/src/include/access/gin_private.h +++ b/src/include/access/gin_private.h @@ -352,6 +352,11 @@ typedef struct GinScanEntryData TIDBitmap *matchBitmap; TBMIterator *matchIterator; TBMIterateResult *matchResult; + // TODO: a temporary hack to deal with the fact that I am + // 1) not sure if InvalidBlockNumber can come up for other reasons than exhausting the bitmap + // and 2) not having taken the time yet to check all the places where matchResult == NULL + // is used to make sure I can replace it with something else + TBMIterateResult *savedMatchResult; /* used for Posting list and one page in Posting tree */ ItemPointerData *list; diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 331f5c6716..c4d653e923 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -20,6 +20,7 @@ #include "access/skey.h" #include "access/table.h" /* for backward compatibility */ #include "access/tableam.h" +#include "nodes/execnodes.h" #include "nodes/lockoptions.h" #include "nodes/primnodes.h" #include "storage/bufpage.h" @@ -225,5 +226,6 @@ extern bool ResolveCminCmaxDuringDecoding(struct HTAB *tuplecid_data, CommandId *cmin, CommandId *cmax); extern void HeapCheckForSerializableConflictOut(bool valid, Relation relation, HeapTuple tuple, Buffer buffer, Snapshot snapshot); +extern void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate); #endif /* HEAPAM_H */ diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 414b6b4d57..fea54384ec 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -786,7 +786,7 @@ typedef struct TableAmRoutine * scan_bitmap_next_tuple need to exist, or neither. */ bool (*scan_bitmap_next_block) (TableScanDesc scan, - struct TBMIterateResult *tbmres); + struct TBMIterateResult **tbmres); /* * Fetch the next tuple of a bitmap table scan into `slot` and return true @@ -1929,7 +1929,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths, */ static inline bool table_scan_bitmap_next_block(TableScanDesc scan, - struct TBMIterateResult *tbmres) + struct TBMIterateResult **tbmres) { /* * We don't expect direct calls to table_scan_bitmap_next_block with valid diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h index 3b0bd5acb8..64d8c6a07c 100644 --- a/src/include/executor/nodeBitmapHeapscan.h +++ b/src/include/executor/nodeBitmapHeapscan.h @@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node, ParallelContext *pcxt); extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node, ParallelWorkerContext *pwcxt); +extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate); #endif /* NODEBITMAPHEAPSCAN_H */ diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index e31ad6204e..fc69ea55d9 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -1532,11 +1532,7 @@ typedef enum /* ---------------- * ParallelBitmapHeapState information * tbmiterator iterator for scanning current pages - * prefetch_iterator iterator for prefetching ahead of current page - * mutex mutual exclusion for the prefetching variable - * and state - * prefetch_pages # pages prefetch iterator is ahead of current - * prefetch_target current target prefetch distance + * mutex mutual exclusion for the state * state current state of the TIDBitmap * cv conditional wait variable * phs_snapshot_data snapshot data shared to workers @@ -1545,10 +1541,7 @@ typedef enum typedef struct ParallelBitmapHeapState { dsa_pointer tbmiterator; - dsa_pointer prefetch_iterator; slock_t mutex; - int prefetch_pages; - int prefetch_target; SharedBitmapState state; ConditionVariable cv; char phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER]; @@ -1559,22 +1552,16 @@ typedef struct ParallelBitmapHeapState * * bitmapqualorig execution state for bitmapqualorig expressions * tbm bitmap obtained from child index scan(s) - * tbmiterator iterator for scanning current pages + * tbmiterator iterator for scanning pages * tbmres current-page data * can_skip_fetch can we potentially skip tuple fetches in this scan? * return_empty_tuples number of empty tuples to return * vmbuffer buffer for visibility-map lookups - * pvmbuffer ditto, for prefetched pages * exact_pages total number of exact pages retrieved * lossy_pages total number of lossy pages retrieved - * prefetch_iterator iterator for prefetching ahead of current page - * prefetch_pages # pages prefetch iterator is ahead of current - * prefetch_target current target prefetch distance - * prefetch_maximum maximum value for prefetch_target * pscan_len size of the shared memory for parallel bitmap * initialized is node is ready to iterate * shared_tbmiterator shared iterator - * shared_prefetch_iterator shared iterator for prefetching * pstate shared state for parallel bitmap scan * ---------------- */ @@ -1586,19 +1573,12 @@ typedef struct BitmapHeapScanState TBMIterator *tbmiterator; TBMIterateResult *tbmres; bool can_skip_fetch; - int return_empty_tuples; Buffer vmbuffer; - Buffer pvmbuffer; long exact_pages; long lossy_pages; - TBMIterator *prefetch_iterator; - int prefetch_pages; - int prefetch_target; - int prefetch_maximum; Size pscan_len; bool initialized; TBMSharedIterator *shared_tbmiterator; - TBMSharedIterator *shared_prefetch_iterator; ParallelBitmapHeapState *pstate; } BitmapHeapScanState; diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h index bc67166105..236de80f23 100644 --- a/src/include/nodes/tidbitmap.h +++ b/src/include/nodes/tidbitmap.h @@ -23,6 +23,8 @@ #define TIDBITMAP_H #include "storage/itemptr.h" +// TODO: not great that I am including this now +#include "storage/buf.h" #include "utils/dsa.h" @@ -40,6 +42,7 @@ typedef struct TBMSharedIterator TBMSharedIterator; typedef struct TBMIterateResult { BlockNumber blockno; /* page number containing tuples */ + Buffer buffer; int ntuples; /* -1 indicates lossy result */ bool recheck; /* should the tuples be rechecked? */ /* Note: recheck is always true if ntuples < 0 */ @@ -64,8 +67,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm); extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm); extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm); -extern TBMIterateResult *tbm_iterate(TBMIterator *iterator); -extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator); +extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres); +extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres); extern void tbm_end_iterate(TBMIterator *iterator); extern void tbm_end_shared_iterate(TBMSharedIterator *iterator); extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa, diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h index 9a07f06b9f..8e1aa48827 100644 --- a/src/include/storage/aio.h +++ b/src/include/storage/aio.h @@ -39,7 +39,7 @@ typedef enum IoMethod } IoMethod; /* We'll default to bgworker. */ -#define DEFAULT_IO_METHOD IOMETHOD_WORKER +#define DEFAULT_IO_METHOD IOMETHOD_IO_URING /* GUCs */ extern int io_method; -- 2.27.0