On Wed, Jul 28, 2021 at 1:37 PM Melanie Plageman
<melanieplage...@gmail.com> wrote:
>
> On Tue, Feb 23, 2021 at 5:04 AM Andres Freund <and...@anarazel.de> wrote:
> >
> > ## AIO API overview
> >
> > The main steps to use AIO (without higher level helpers) are:
> >
> > 1) acquire an "unused" AIO: pgaio_io_get()
> >
> > 2) start some IO, this is done by functions like
> >    pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
> >
> >    The (read|write|fsync|flush_range) indicates the operation, whereas
> >    (smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
> >
> >    (see below for more details about this design choice - it might or not be
> >    right)
> >
> > 3) optionally: assign a backend-local completion callback to the IO
> >    (pgaio_io_on_completion_local())
> >
> > 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
> >    put on a per-backend list of pending IOs. The pending IOs can be 
> > explicitly
> >    be flushed pgaio_submit_pending(), but will also be submitted if the
> >    pending list gets to be too large, or if the current backend waits for 
> > the
> >    IO.
> >
> >    The are two main reasons not to submit the IO immediately:
> >    - If adjacent, we can merge several IOs into one "kernel level" IO during
> >      submission. Larger IOs are considerably more efficient.
> >    - Several AIO APIs allow to submit a batch of IOs in one system call.
> >
> > 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
> >    backend. When other backends may need to wait for an IO to finish,
> >    pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
> >    BufferDesc), which can be waited for using pgaio_io_wait_ref().
> >
> > 6) Process the results of the request. If a callback was registered in 3),
> >    this isn't always necessary. The results of AIO can be accessed using
> >    pgaio_io_result() which returns an integer where negative numbers are
> >    -errno, and positive numbers are the [partial] success conditions
> >    (e.g. potentially indicating a short read).
> >
> > 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
> >    another operation (pgaio_io_recycle())
> >
> >
> > Most places that want to use AIO shouldn't themselves need to care about
> > managing the number of writes in flight, or the readahead distance. To help
> > with that there are two helper utilities, a "streaming read" and a 
> > "streaming
> > write".
> >
> > The "streaming read" helper uses a callback to determine which blocks to
> > prefetch - that allows to do readahead in a sequential fashion but 
> > importantly
> > also allows to asynchronously "read ahead" non-sequential blocks.
> >
> > E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility 
> > map
> > to figure out which block needs to be read next. Similarly 
> > lazy_vacuum_heap()
> > uses the tids in LVDeadTuples to figure out which blocks are going to be
> > needed. Here's the latter as an example:
> > https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
> >
>
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.
>
...
>
> Oh, and I haven't done testing to see how effective the prefetching is
> -- that is a larger project that I have yet to tackle.
>

I have done some testing on how effective it is now.

I've also updated the original patch to count the first page (in the
lossy/exact page counts mentioned down-thread) as well as to remove
unused prefetch fields and comments.
I've also included a second patch which adds IO wait time information to
EXPLAIN output when used like:
  EXPLAIN (buffers, analyze) SELECT ...

The same commit also introduces a temporary dev GUC
io_bitmap_prefetch_depth which I am using to experiment with the
prefetch window size.

I wanted to share some results from changing the prefetch window to
demonstrate how prefetching is working.

The short version of my results is that the prefetching works:

- with the prefetch window set to 1, the IO wait time is 1550 ms
- with the prefetch window set to 128, the IO wait time is 0.18 ms

DDL and repro details below:

On Andres' AIO branch [1] with my bitmap heapscan prefetching patch set
applied built with the following build flags:
-02 -fno-omit-frame-pointer --with-liburing

And these non-default PostgreSQL settings:
  io_data_direct=1
  io_data_force_async=off
  io_method=io_uring
  log_min_duration_statement=0
  log_duration=on
  set track_io_timing to on;

  set max_parallel_workers_per_gather to 0;
  set enable_seqscan to off;
  set enable_indexscan to off;
  set enable_bitmapscan to on;

  set effective_io_concurrency to 128;
  set io_bitmap_prefetch_depth to 128;

Using this DDL:

drop table if exists bar;
create table bar(a int, b text, c text, d int);
create index bar_idx on bar(a);
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,1000)i;
insert into bar select i%3, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,1000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,200)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%2000, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,10000000)i;
insert into bar select i%100, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,10000000)i;
insert into bar select i, md5(i::text), 'abcdefghijklmnopqrstuvwxyz',
i from generate_series(1,2000)i;
insert into bar select i%10, md5(i::text),
'abcdefghijklmnopqrstuvwxyz', i from generate_series(1,2000)i;
analyze;

And this query:

select * from bar where a > 100 offset 10000000000000;

with the prefetch window set to 1,
the query execution time is:
5496.129 ms

and IO wait time is:
1550.915

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
   Buffers: shared hit=1 read=280571
   I/O Timings: read=1315.845 wait=1550.915
   ->  Bitmap Heap Scan on bar  (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
         Recheck Cond: (a > 100)
         Rows Removed by Index Recheck: 400281
         Heap Blocks: exact=47915 lossy=197741
         Buffers: shared hit=1 read=280571
         I/O Timings: read=1315.845 wait=1550.915
         ->  Bitmap Index Scan on bar_idx  (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
               Index Cond: (a > 100)
               Buffers: shared hit=1 read=34915
               I/O Timings: read=1315.845
 Planning:
   Buffers: shared hit=96 read=30
   I/O Timings: read=3.399
 Planning Time: 4.378 ms
 Execution Time: 5473.404 ms
(18 rows)

with the prefetch window set to 128,
the query execution time is:
3222 ms

and IO wait time is;
0.178 ms

mplageman=# explain (buffers, analyze, timing off) select * from bar
where a > 100 offset 10000000000000;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1462959.87..1462959.87 rows=1 width=68) (actual rows=0 loops=1)
   Buffers: shared hit=1 read=280571
   I/O Timings: read=1339.795 wait=0.178
   ->  Bitmap Heap Scan on bar  (cost=240521.25..1462959.87
rows=19270298 width=68) (actual rows=19497800 loops=1)
         Recheck Cond: (a > 100)
         Rows Removed by Index Recheck: 400281
         Heap Blocks: exact=47915 lossy=197741
         Buffers: shared hit=1 read=280571
         I/O Timings: read=1339.795 wait=0.178
         ->  Bitmap Index Scan on bar_idx  (cost=0.00..235703.67
rows=19270298 width=0) (actual rows=19497800 loops=1)
               Index Cond: (a > 100)
               Buffers: shared hit=1 read=34915
               I/O Timings: read=1339.795
 Planning:
   Buffers: shared hit=96 read=30
   I/O Timings: read=3.488
 Planning Time: 4.279 ms
 Execution Time: 3434.522 ms
(18 rows)

- Melanie

[1] https://github.com/anarazel/postgres/tree/aio
From 752fd6cc636eea08c06b25d8898e091835442387 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 5 Aug 2021 15:47:50 -0400
Subject: [PATCH v2 2/2] Add IO wait time stat and add guc for BHS prefetch

- Add an IO wait time measurement which can be seen in EXPLAIN output
  (with explain buffers option and track_io_timing on)
- TODO: add the wait time to database statistics
- Also, add a GUC to control the BHS pre-fetch max window size
---
 src/backend/access/heap/heapam_handler.c |  2 +-
 src/backend/commands/explain.c           |  9 ++++++++-
 src/backend/executor/instrument.c        |  4 ++++
 src/backend/postmaster/pgstat.c          |  6 ++++++
 src/backend/storage/aio/aio.c            | 16 ++++++++++++++++
 src/backend/storage/aio/aio_util.c       |  2 --
 src/backend/storage/buffer/bufmgr.c      |  2 ++
 src/backend/utils/adt/pgstatfuncs.c      |  2 ++
 src/backend/utils/misc/guc.c             | 10 ++++++++++
 src/include/executor/instrument.h        |  1 +
 src/include/pgstat.h                     |  6 ++++++
 src/include/storage/bufmgr.h             |  1 +
 12 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8bd8050dc..7409dd2fb3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2221,7 +2221,7 @@ void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate)
 	HeapScanDesc hscan = (HeapScanDesc ) scanstate->ss.ss_currentScanDesc;
 	if (!hscan->rs_inited)
 	{
-		int iodepth = Max(Min(128, NBuffers / 128), 1);
+		int iodepth = io_bitmap_prefetch_depth;
 		hscan->pgsr = pg_streaming_read_alloc(iodepth, (uintptr_t) scanstate,
 		                                      bitmapheapscan_pgsr_next_single,
 		                                      bitmapheapscan_pgsr_release);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c66d39a5c7..d1e89f1e1b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3339,7 +3339,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		bool		has_temp = (usage->temp_blks_read > 0 ||
 								usage->temp_blks_written > 0);
 		bool		has_timing = (!INSTR_TIME_IS_ZERO(usage->blk_read_time) ||
-								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
+								  !INSTR_TIME_IS_ZERO(usage->blk_write_time ||
+								  !INSTR_TIME_IS_ZERO(usage->io_wait_time)));
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing));
 
@@ -3416,6 +3417,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			if (!INSTR_TIME_IS_ZERO(usage->blk_write_time))
 				appendStringInfo(es->str, " write=%0.3f",
 								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time));
+			if (!INSTR_TIME_IS_ZERO(usage->io_wait_time))
+				appendStringInfo(es->str, " wait=%0.3f",
+								 INSTR_TIME_GET_MILLISEC(usage->io_wait_time));
 			appendStringInfoChar(es->str, '\n');
 		}
 
@@ -3452,6 +3456,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			ExplainPropertyFloat("I/O Write Time", "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->blk_write_time),
 								 3, es);
+			ExplainPropertyFloat("I/O Wait Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->io_wait_time),
+								 5, es);
 		}
 	}
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 96af2a2673..85e7321d63 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -218,6 +218,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->temp_blks_written += add->temp_blks_written;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+	INSTR_TIME_ADD(dst->io_wait_time, add->io_wait_time);
+
 }
 
 /* dst += add - sub */
@@ -240,6 +242,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
+	INSTR_TIME_ACCUM_DIFF(dst->io_wait_time,
+						  add->io_wait_time, sub->io_wait_time);
 }
 
 /* helper functions for WAL usage accumulation */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9d335b8507..028ca14aa8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -258,6 +258,7 @@ static int	pgStatXactCommit = 0;
 static int	pgStatXactRollback = 0;
 PgStat_Counter pgStatBlockReadTime = 0;
 PgStat_Counter pgStatBlockWriteTime = 0;
+PgStat_Counter pgStatIOWaitTime = 0;
 static PgStat_Counter pgStatActiveTime = 0;
 static PgStat_Counter pgStatTransactionIdleTime = 0;
 SessionEndType pgStatSessionEndCause = DISCONNECT_NORMAL;
@@ -1004,10 +1005,12 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
 		tsmsg->m_xact_rollback = pgStatXactRollback;
 		tsmsg->m_block_read_time = pgStatBlockReadTime;
 		tsmsg->m_block_write_time = pgStatBlockWriteTime;
+		tsmsg->m_io_wait_time = pgStatIOWaitTime;
 		pgStatXactCommit = 0;
 		pgStatXactRollback = 0;
 		pgStatBlockReadTime = 0;
 		pgStatBlockWriteTime = 0;
+		pgStatIOWaitTime = 0;
 	}
 	else
 	{
@@ -1015,6 +1018,7 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
 		tsmsg->m_xact_rollback = 0;
 		tsmsg->m_block_read_time = 0;
 		tsmsg->m_block_write_time = 0;
+		tsmsg->m_io_wait_time = 0;
 	}
 
 	n = tsmsg->m_nentries;
@@ -5122,6 +5126,7 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 	dbentry->last_checksum_failure = 0;
 	dbentry->n_block_read_time = 0;
 	dbentry->n_block_write_time = 0;
+	dbentry->n_io_wait_time = 0;
 	dbentry->n_sessions = 0;
 	dbentry->total_session_time = 0;
 	dbentry->total_active_time = 0;
@@ -6442,6 +6447,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 	dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
 	dbentry->n_block_read_time += msg->m_block_read_time;
 	dbentry->n_block_write_time += msg->m_block_write_time;
+	dbentry->n_io_wait_time += msg->m_io_wait_time;
 
 	/*
 	 * Process all table entries in the message.
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 726d50f698..c5aacbcf9d 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -36,6 +36,8 @@
 #include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
+#include "executor/instrument.h"
+#include "storage/bufmgr.h"
 
 
 #define PGAIO_VERBOSE
@@ -1729,6 +1731,12 @@ wait_ref_again:
 		}
 		else if (io_method != IOMETHOD_WORKER && (flags & PGAIOIP_INFLIGHT))
 		{
+			int io_wait_start, io_wait_time;
+			if (track_io_timing)
+				INSTR_TIME_SET_CURRENT(io_wait_start);
+			else
+				INSTR_TIME_SET_ZERO(io_wait_start);
+
 			/* note that this is allowed to spuriously return */
 			if (io_method == IOMETHOD_WORKER)
 				ConditionVariableSleep(&io->cv, WAIT_EVENT_AIO_IO_COMPLETE_ONE);
@@ -1741,6 +1749,14 @@ wait_ref_again:
 			else if (io_method == IOMETHOD_POSIX)
 				pgaio_posix_wait_one(io, ref_generation);
 #endif
+
+			if (track_io_timing)
+			{
+				INSTR_TIME_SET_CURRENT(io_wait_time);
+				INSTR_TIME_SUBTRACT(io_wait_time, io_wait_start);
+				pgstat_count_io_wait_time(INSTR_TIME_GET_MICROSEC(io_wait_time));
+				INSTR_TIME_ADD(pgBufferUsage.io_wait_time, io_wait_time);
+			}
 		}
 #ifdef USE_POSIX_AIO
 		/* XXX untangle this */
diff --git a/src/backend/storage/aio/aio_util.c b/src/backend/storage/aio/aio_util.c
index 35436dfcd3..a79db4d747 100644
--- a/src/backend/storage/aio/aio_util.c
+++ b/src/backend/storage/aio/aio_util.c
@@ -417,8 +417,6 @@ pg_streaming_read_alloc(uint32 iodepth, uintptr_t pgsr_private,
 {
 	PgStreamingRead *pgsr;
 
-	iodepth = Max(Min(iodepth, NBuffers / 128), 1);
-
 	pgsr = palloc0(offsetof(PgStreamingRead, all_items) +
 				   sizeof(PgStreamingReadItem) * iodepth * 2);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5df1b78f2..69175699b0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -145,6 +145,8 @@ bool		track_io_timing = false;
  */
 int			effective_io_concurrency = 0;
 
+int  io_bitmap_prefetch_depth = 128;
+
 /*
  * Like effective_io_concurrency, but used by maintenance code paths that might
  * benefit from a higher setting because they work on behalf of many sessions.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2fca05f7af..a6d4f121e1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1615,6 +1615,8 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
 	PG_RETURN_FLOAT8(result);
 }
 
+// TODO: add one for io wait time
+
 Datum
 pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 {
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d143783f22..f4f02fb9ad 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3029,6 +3029,16 @@ static struct config_int ConfigureNamesInt[] =
 		0, MAX_IO_CONCURRENCY,
 		check_maintenance_io_concurrency, NULL, NULL
 	},
+	{
+		{"io_bitmap_prefetch_depth", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum pre-fetch distance for bitmapheapscan"),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&io_bitmap_prefetch_depth,
+		128, 0, MAX_IO_CONCURRENCY,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"io_wal_concurrency", PGC_POSTMASTER, RESOURCES_ASYNCHRONOUS,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 62ca398e9d..950414cc0b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -30,6 +30,7 @@ typedef struct BufferUsage
 	long		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
+	instr_time io_wait_time;
 } BufferUsage;
 
 typedef struct WalUsage
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b77dcbc58b..a02813f02f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -291,6 +291,7 @@ typedef struct PgStat_MsgTabstat
 	int			m_xact_rollback;
 	PgStat_Counter m_block_read_time;	/* times in microseconds */
 	PgStat_Counter m_block_write_time;
+	PgStat_Counter m_io_wait_time;
 	PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
 } PgStat_MsgTabstat;
 
@@ -732,6 +733,7 @@ typedef struct PgStat_StatDBEntry
 	TimestampTz last_checksum_failure;
 	PgStat_Counter n_block_read_time;	/* times in microseconds */
 	PgStat_Counter n_block_write_time;
+	PgStat_Counter n_io_wait_time;
 	PgStat_Counter n_sessions;
 	PgStat_Counter total_session_time;
 	PgStat_Counter total_active_time;
@@ -1417,6 +1419,8 @@ extern PgStat_MsgWal WalStats;
 extern PgStat_Counter pgStatBlockReadTime;
 extern PgStat_Counter pgStatBlockWriteTime;
 
+extern PgStat_Counter pgStatIOWaitTime;
+
 /*
  * Updated by the traffic cop and in errfinish()
  */
@@ -1593,6 +1597,8 @@ pgstat_report_wait_end(void)
 	(pgStatBlockReadTime += (n))
 #define pgstat_count_buffer_write_time(n)							\
 	(pgStatBlockWriteTime += (n))
+#define pgstat_count_io_wait_time(n)							\
+	(pgStatIOWaitTime  += (n))
 
 extern void pgstat_count_heap_insert(Relation rel, PgStat_Counter n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 07401f8493..8c21bb6e56 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -70,6 +70,7 @@ extern int	bgwriter_lru_maxpages;
 extern double bgwriter_lru_multiplier;
 extern bool track_io_timing;
 extern int	effective_io_concurrency;
+extern int io_bitmap_prefetch_depth;
 extern int	maintenance_io_concurrency;
 
 extern int	checkpoint_flush_after;
-- 
2.27.0

From 4ef0eaf4aed17becd94d085c8092b8b79d7bca93 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Tue, 22 Jun 2021 16:14:58 -0400
Subject: [PATCH v2 1/2] Use pgsr for AIO bitmapheapscan

---
 src/backend/access/gin/ginget.c           |  18 +-
 src/backend/access/gin/ginscan.c          |   4 +
 src/backend/access/heap/heapam_handler.c  | 190 +++++++-
 src/backend/executor/nodeBitmapHeapscan.c | 505 ++--------------------
 src/backend/nodes/tidbitmap.c             |  55 ++-
 src/include/access/gin_private.h          |   5 +
 src/include/access/heapam.h               |   2 +
 src/include/access/tableam.h              |   4 +-
 src/include/executor/nodeBitmapHeapscan.h |   1 +
 src/include/nodes/execnodes.h             |  24 +-
 src/include/nodes/tidbitmap.h             |   7 +-
 src/include/storage/aio.h                 |   2 +-
 12 files changed, 279 insertions(+), 538 deletions(-)

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 03191e016c..ef7c284cd0 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -311,6 +311,8 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
 	}
 }
 
+#define MAX_TUPLES_PER_PAGE  MaxHeapTuplesPerPage
+
 /*
  * Start* functions setup beginning state of searches: finds correct buffer and pins it.
  */
@@ -332,6 +334,7 @@ restartScanEntry:
 	entry->nlist = 0;
 	entry->matchBitmap = NULL;
 	entry->matchResult = NULL;
+	entry->savedMatchResult = NULL;
 	entry->reduceResult = false;
 	entry->predictNumberResult = 0;
 
@@ -372,7 +375,10 @@ restartScanEntry:
 			if (entry->matchBitmap)
 			{
 				if (entry->matchIterator)
+				{
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->savedMatchResult);
+				}
 				entry->matchIterator = NULL;
 				tbm_free(entry->matchBitmap);
 				entry->matchBitmap = NULL;
@@ -385,6 +391,8 @@ restartScanEntry:
 		if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
 		{
 			entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+			entry->savedMatchResult = (TBMIterateResult *) palloc0(sizeof(TBMIterateResult) +
+				                                                        MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
 			entry->isFinished = false;
 		}
 	}
@@ -790,6 +798,7 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry,
 #define gin_rand() (((double) random()) / ((double) MAX_RANDOM_VALUE))
 #define dropItem(e) ( gin_rand() > ((double)GinFuzzySearchLimit)/((double)((e)->predictNumberResult)) )
 
+
 /*
  * Sets entry->curItem to next heap item pointer > advancePast, for one entry
  * of one scan key, or sets entry->isFinished to true if there are no more.
@@ -817,7 +826,6 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		/* A bitmap result */
 		BlockNumber advancePastBlk = GinItemPointerGetBlockNumber(&advancePast);
 		OffsetNumber advancePastOff = GinItemPointerGetOffsetNumber(&advancePast);
-
 		for (;;)
 		{
 			/*
@@ -831,12 +839,18 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 				   (ItemPointerIsLossyPage(&advancePast) &&
 					entry->matchResult->blockno == advancePastBlk))
 			{
-				entry->matchResult = tbm_iterate(entry->matchIterator);
+
+				tbm_iterate(entry->matchIterator, entry->savedMatchResult);
+				if (!BlockNumberIsValid(entry->savedMatchResult->blockno))
+					entry->matchResult = NULL;
+				else
+					entry->matchResult = entry->savedMatchResult;
 
 				if (entry->matchResult == NULL)
 				{
 					ItemPointerSetInvalid(&entry->curItem);
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->savedMatchResult);
 					entry->matchIterator = NULL;
 					entry->isFinished = true;
 					break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index 55e2d49fd7..3fd9310887 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -107,6 +107,7 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
 	scanEntry->matchBitmap = NULL;
 	scanEntry->matchIterator = NULL;
 	scanEntry->matchResult = NULL;
+	scanEntry->savedMatchResult = NULL;
 	scanEntry->list = NULL;
 	scanEntry->nlist = 0;
 	scanEntry->offset = InvalidOffsetNumber;
@@ -246,7 +247,10 @@ ginFreeScanKeys(GinScanOpaque so)
 		if (entry->list)
 			pfree(entry->list);
 		if (entry->matchIterator)
+		{
 			tbm_end_iterate(entry->matchIterator);
+			pfree(entry->savedMatchResult);
+		}
 		if (entry->matchBitmap)
 			tbm_free(entry->matchBitmap);
 	}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9c65741c41..a8bd8050dc 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -27,6 +27,7 @@
 #include "access/syncscan.h"
 #include "access/tableam.h"
 #include "access/tsmapi.h"
+#include "access/visibilitymap.h"
 #include "access/xact.h"
 #include "catalog/catalog.h"
 #include "catalog/index.h"
@@ -36,6 +37,7 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "storage/aio.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
@@ -57,6 +59,11 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 
 static const TableAmRoutine heapam_methods;
 
+static PgStreamingReadNextStatus
+bitmapheapscan_pgsr_next_single(uintptr_t pgsr_private, PgAioInProgress *aio, uintptr_t *read_private);
+static void
+bitmapheapscan_pgsr_release(uintptr_t pgsr_private, uintptr_t read_private);
+
 
 /* ------------------------------------------------------------------------
  * Slot related callbacks for heap AM
@@ -2106,13 +2113,128 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
  * Executor related callbacks for the heap AM
  * ------------------------------------------------------------------------
  */
+#define MAX_TUPLES_PER_PAGE  MaxHeapTuplesPerPage
+
+// TODO: for heap, these are in heapam.c instead of heapam_handler.c
+// but, heap may move where it does the setup of pgsr
+static void
+bitmapheapscan_pgsr_release(uintptr_t pgsr_private, uintptr_t read_private)
+{
+	BitmapHeapScanState *bhs_state = (BitmapHeapScanState *) pgsr_private;
+	HeapScanDesc hdesc = (HeapScanDesc ) bhs_state->ss.ss_currentScanDesc;
+	TBMIterateResult *tbmres = (TBMIterateResult *) read_private;
+
+	ereport(DEBUG3,
+	        errmsg("pgsr %s: releasing buf %d",
+	               NameStr(hdesc->rs_base.rs_rd->rd_rel->relname),
+	               tbmres->buffer),
+	        errhidestmt(true),
+	        errhidecontext(true));
+
+	Assert(BufferIsValid(tbmres->buffer));
+	ReleaseBuffer(tbmres->buffer);
+}
+
+static PgStreamingReadNextStatus
+bitmapheapscan_pgsr_next_single(uintptr_t pgsr_private, PgAioInProgress *aio, uintptr_t *read_private)
+{
+	bool already_valid;
+	bool skip_fetch;
+	BitmapHeapScanState *bhs_state = (BitmapHeapScanState *) pgsr_private;
+	/*
+	 * TODO: instead of passing the BitmapHeapScanState node when setting up
+	 * and ultimately using it here as pgsr_private, perhaps I can pass only the
+	 * iterator by adding a pointer to the HeapScanDesc to the iterator and
+	 * moving the vmbuffer into the heapscandesc and also add can_skip_fetch to
+	 * the iterator and then pass the iterator as the private state.
+	 * If doing this, will need a separate bitmapheapscan_pgsr_next_parallel() in
+	 * addition to the bitmapheapscan_pgsr_next_single() which would use the
+	 * shared_tbmiterator instead of the tbmiterator() (and then would need separate
+	 * alloc functions for setup and potentially different release functions).
+	 */
+	ParallelBitmapHeapState *pstate = bhs_state->pstate;
+	HeapScanDesc hdesc = (HeapScanDesc ) bhs_state->ss.ss_currentScanDesc;
+	TBMIterateResult *tbmres = (TBMIterateResult *) palloc0(sizeof(TBMIterateResult) +
+		                                                                MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+	Assert(bhs_state->initialized);
+	if (pstate == NULL)
+		tbm_iterate(bhs_state->tbmiterator, tbmres);
+	else
+		tbm_shared_iterate(bhs_state->shared_tbmiterator, tbmres);
+
+	// TODO: could this be invalid for another reason than hit_end?
+	if (!BlockNumberIsValid(tbmres->blockno))
+	{
+		pfree(tbmres);
+		tbmres = NULL;
+		*read_private = 0;
+		return PGSR_NEXT_END;
+	}
+	/*
+	 * Ignore any claimed entries past what we think is the end of the
+	 * relation. It may have been extended after the start of our scan (we
+	 * only hold an AccessShareLock, and it could be inserts from this
+	 * backend).
+	 */
+	if (tbmres->blockno >= hdesc->rs_nblocks)
+	{
+		tbmres->blockno = InvalidBlockNumber;
+		*read_private = (uintptr_t) tbmres;
+		return PGSR_NEXT_NO_IO;
+	}
+
+	/*
+	 * We can skip fetching the heap page if we don't need any fields
+	 * from the heap, and the bitmap entries don't need rechecking,
+	 * and all tuples on the page are visible to our transaction.
+	 */
+	skip_fetch = (bhs_state->can_skip_fetch && !tbmres->recheck &&
+		VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno,
+		               &bhs_state->vmbuffer));
+
+	if (skip_fetch)
+	{
+		/*
+		 * The number of tuples on this page is put into
+		 * node->return_empty_tuples.
+		 */
+		tbmres->buffer = InvalidBuffer;
+		*read_private = (uintptr_t) tbmres;
+		return PGSR_NEXT_NO_IO;
+	}
+	tbmres->buffer = ReadBufferAsync(hdesc->rs_base.rs_rd,
+	                                      MAIN_FORKNUM,
+	                                      tbmres->blockno,
+	                                      RBM_NORMAL, hdesc->rs_strategy, &already_valid,
+	                                      &aio);
+	*read_private = (uintptr_t) tbmres;
+
+	if (already_valid)
+		return PGSR_NEXT_NO_IO;
+	else
+		return PGSR_NEXT_IO;
+}
+
+// TODO: put this in the right place
+void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate)
+{
+	HeapScanDesc hscan = (HeapScanDesc ) scanstate->ss.ss_currentScanDesc;
+	if (!hscan->rs_inited)
+	{
+		int iodepth = Max(Min(128, NBuffers / 128), 1);
+		hscan->pgsr = pg_streaming_read_alloc(iodepth, (uintptr_t) scanstate,
+		                                      bitmapheapscan_pgsr_next_single,
+		                                      bitmapheapscan_pgsr_release);
+
+		hscan->rs_inited = true;
+	}
+}
 
 static bool
 heapam_scan_bitmap_next_block(TableScanDesc scan,
-							  TBMIterateResult *tbmres)
+                              TBMIterateResult **tbmres)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
-	BlockNumber page = tbmres->blockno;
 	Buffer		buffer;
 	Snapshot	snapshot;
 	int			ntup;
@@ -2120,22 +2242,35 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 	hscan->rs_cindex = 0;
 	hscan->rs_ntuples = 0;
 
-	/*
-	 * Ignore any claimed entries past what we think is the end of the
-	 * relation. It may have been extended after the start of our scan (we
-	 * only hold an AccessShareLock, and it could be inserts from this
-	 * backend).
-	 */
-	if (page >= hscan->rs_nblocks)
+	Assert(hscan->pgsr);
+	if (*tbmres)
+	{
+		if (BufferIsValid((*tbmres)->buffer))
+			ReleaseBuffer((*tbmres)->buffer);
+		hscan->rs_cbuf = InvalidBuffer;
+		pfree(*tbmres);
+	}
+
+	*tbmres = (TBMIterateResult *) pg_streaming_read_get_next(hscan->pgsr);
+	/* hit the end */
+	if (*tbmres == NULL)
+		return true;
+
+	/* Invalid due to past the end of the relation */
+	if (!BlockNumberIsValid((*tbmres)->blockno))
+	{
+		pfree(*tbmres);
+		*tbmres = NULL;
 		return false;
+	}
+
+	hscan->rs_cblock = (*tbmres)->blockno;
+	hscan->rs_cbuf = (*tbmres)->buffer;
+
+	/* Skipped fetching, we'll still use ntuples though */
+	if (!(BufferIsValid(hscan->rs_cbuf)))
+		return true;
 
-	/*
-	 * Acquire pin on the target heap page, trading in any pin we held before.
-	 */
-	hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
-										  scan->rs_rd,
-										  page);
-	hscan->rs_cblock = page;
 	buffer = hscan->rs_cbuf;
 	snapshot = scan->rs_snapshot;
 
@@ -2156,7 +2291,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 	/*
 	 * We need two separate strategies for lossy and non-lossy cases.
 	 */
-	if (tbmres->ntuples >= 0)
+	if ((*tbmres)->ntuples >= 0)
 	{
 		/*
 		 * Bitmap is non-lossy, so we just look through the offsets listed in
@@ -2165,13 +2300,13 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 		 */
 		int			curslot;
 
-		for (curslot = 0; curslot < tbmres->ntuples; curslot++)
+		for (curslot = 0; curslot < (*tbmres)->ntuples; curslot++)
 		{
-			OffsetNumber offnum = tbmres->offsets[curslot];
+			OffsetNumber offnum = (*tbmres)->offsets[curslot];
 			ItemPointerData tid;
 			HeapTupleData heapTuple;
 
-			ItemPointerSet(&tid, page, offnum);
+			ItemPointerSet(&tid, (*tbmres)->blockno, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
@@ -2199,7 +2334,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp);
 			loctup.t_len = ItemIdGetLength(lp);
 			loctup.t_tableOid = scan->rs_rd->rd_id;
-			ItemPointerSet(&loctup.t_self, page, offnum);
+			ItemPointerSet(&loctup.t_self, (*tbmres)->blockno, offnum);
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
@@ -2230,6 +2365,19 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
 	Page		dp;
 	ItemId		lp;
 
+	/* we skipped fetching */
+	if (BufferIsInvalid(tbmres->buffer))
+	{
+		Assert(tbmres->ntuples >= 0);
+		if (tbmres->ntuples > 0)
+		{
+			ExecStoreAllNullTuple(slot);
+			tbmres->ntuples--;
+			return true;
+		}
+		Assert(tbmres->ntuples == 0);
+		return false;
+	}
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
 	 */
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 3861bd8a24..3ef25a8f36 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -36,6 +36,8 @@
 #include "postgres.h"
 
 #include <math.h>
+// TODO: delete me after moving scan setup function
+#include "access/heapam.h"
 
 #include "access/relscan.h"
 #include "access/tableam.h"
@@ -55,13 +57,13 @@
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-												TBMIterateResult *tbmres);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
-								  TableScanDesc scan);
 static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
 
+// TODO: add to tableam
+void table_bitmap_scan_setup(BitmapHeapScanState *scanstate)
+{
+	bitmapheap_pgsr_alloc(scanstate);
+}
 
 /* ----------------------------------------------------------------
  *		BitmapHeapNext
@@ -75,10 +77,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	ExprContext *econtext;
 	TableScanDesc scan;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator = NULL;
-	TBMSharedIterator *shared_tbmiterator = NULL;
-	TBMIterateResult *tbmres;
-	TupleTableSlot *slot;
+	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 	ParallelBitmapHeapState *pstate = node->pstate;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
@@ -89,23 +88,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	slot = node->ss.ss_ScanTupleSlot;
 	scan = node->ss.ss_currentScanDesc;
 	tbm = node->tbm;
-	if (pstate == NULL)
-		tbmiterator = node->tbmiterator;
-	else
-		shared_tbmiterator = node->shared_tbmiterator;
-	tbmres = node->tbmres;
 
 	/*
 	 * If we haven't yet performed the underlying index scan, do it, and begin
 	 * the iteration over the bitmap.
-	 *
-	 * For prefetching, we use *two* iterators, one for the pages we are
-	 * actually scanning and another that runs ahead of the first for
-	 * prefetching.  node->prefetch_pages tracks exactly how many pages ahead
-	 * the prefetch iterator is.  Also, node->prefetch_target tracks the
-	 * desired prefetch distance, which starts small and increases up to the
-	 * node->prefetch_maximum.  This is to avoid doing a lot of prefetching in
-	 * a scan that stops after a few tuples because of a LIMIT.
 	 */
 	if (!node->initialized)
 	{
@@ -117,17 +103,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				elog(ERROR, "unrecognized result from subplan");
 
 			node->tbm = tbm;
-			node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->prefetch_iterator = tbm_begin_iterate(tbm);
-				node->prefetch_pages = 0;
-				node->prefetch_target = -1;
-			}
-#endif							/* USE_PREFETCH */
+			node->tbmiterator = tbm_begin_iterate(tbm);
 		}
 		else
 		{
@@ -143,180 +119,50 @@ BitmapHeapNext(BitmapHeapScanState *node)
 					elog(ERROR, "unrecognized result from subplan");
 
 				node->tbm = tbm;
-
 				/*
 				 * Prepare to iterate over the TBM. This will return the
 				 * dsa_pointer of the iterator state which will be used by
 				 * multiple processes to iterate jointly.
 				 */
-				pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
-				if (node->prefetch_maximum > 0)
-				{
-					pstate->prefetch_iterator =
-						tbm_prepare_shared_iterate(tbm);
-
-					/*
-					 * We don't need the mutex here as we haven't yet woke up
-					 * others.
-					 */
-					pstate->prefetch_pages = 0;
-					pstate->prefetch_target = -1;
-				}
-#endif
+				pstate->tbmiterator =
+					tbm_prepare_shared_iterate(tbm);
 
 				/* We have initialized the shared state so wake up others. */
 				BitmapDoneInitializingSharedState(pstate);
 			}
 
 			/* Allocate a private iterator and attach the shared state to it */
-			node->shared_tbmiterator = shared_tbmiterator =
+			node->shared_tbmiterator =
 				tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->shared_prefetch_iterator =
-					tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
-			}
-#endif							/* USE_PREFETCH */
 		}
 		node->initialized = true;
+		/* do any required setup, such as setting up streaming read helper */
+		// TODO: modify for parallel as relevant
+		table_bitmap_scan_setup(node);
+		/* get the first block */
+		while (!table_scan_bitmap_next_block(scan, &node->tbmres));
+		if (node->tbmres == NULL)
+			return NULL;
+
+		if (node->tbmres->ntuples >= 0)
+			node->exact_pages++;
+		else
+			node->lossy_pages++;
 	}
 
+
+	// TODO: seems like it would be more clear to have an independent function
+	// getting the next tuple and block and then only have the recheck here.
+	// the loop condition would be next_tuple != NULL
 	for (;;)
 	{
-		bool		skip_fetch;
-
 		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * Get next page of results if needed
-		 */
-		if (tbmres == NULL)
-		{
-			if (!pstate)
-				node->tbmres = tbmres = tbm_iterate(tbmiterator);
-			else
-				node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
-			if (tbmres == NULL)
-			{
-				/* no more entries in the bitmap */
-				break;
-			}
-
-			BitmapAdjustPrefetchIterator(node, tbmres);
-
-			/*
-			 * We can skip fetching the heap page if we don't need any fields
-			 * from the heap, and the bitmap entries don't need rechecking,
-			 * and all tuples on the page are visible to our transaction.
-			 *
-			 * XXX: It's a layering violation that we do these checks above
-			 * tableam, they should probably moved below it at some point.
-			 */
-			skip_fetch = (node->can_skip_fetch &&
-						  !tbmres->recheck &&
-						  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-										 tbmres->blockno,
-										 &node->vmbuffer));
-
-			if (skip_fetch)
-			{
-				/* can't be lossy in the skip_fetch case */
-				Assert(tbmres->ntuples >= 0);
-
-				/*
-				 * The number of tuples on this page is put into
-				 * node->return_empty_tuples.
-				 */
-				node->return_empty_tuples = tbmres->ntuples;
-			}
-			else if (!table_scan_bitmap_next_block(scan, tbmres))
-			{
-				/* AM doesn't think this block is valid, skip */
-				continue;
-			}
-
-			if (tbmres->ntuples >= 0)
-				node->exact_pages++;
-			else
-				node->lossy_pages++;
-
-			/* Adjust the prefetch target */
-			BitmapAdjustPrefetchTarget(node);
-		}
-		else
-		{
-			/*
-			 * Continuing in previously obtained page.
-			 */
-
-#ifdef USE_PREFETCH
-
-			/*
-			 * Try to prefetch at least a few pages even before we get to the
-			 * second page if we don't stop reading after the first tuple.
-			 */
-			if (!pstate)
-			{
-				if (node->prefetch_target < node->prefetch_maximum)
-					node->prefetch_target++;
-			}
-			else if (pstate->prefetch_target < node->prefetch_maximum)
-			{
-				/* take spinlock while updating shared state */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_target < node->prefetch_maximum)
-					pstate->prefetch_target++;
-				SpinLockRelease(&pstate->mutex);
-			}
-#endif							/* USE_PREFETCH */
-		}
-
-		/*
-		 * We issue prefetch requests *after* fetching the current page to try
-		 * to avoid having prefetching interfere with the main I/O. Also, this
-		 * should happen only when we have determined there is still something
-		 * to do on the current page, else we may uselessly prefetch the same
-		 * page we are just about to request for real.
-		 *
-		 * XXX: It's a layering violation that we do these checks above
-		 * tableam, they should probably moved below it at some point.
-		 */
-		BitmapPrefetch(node, scan);
-
-		if (node->return_empty_tuples > 0)
+		/* Attempt to fetch tuple from AM. */
+		if (table_scan_bitmap_next_tuple(scan, node->tbmres, slot))
 		{
-			/*
-			 * If we don't have to fetch the tuple, just return nulls.
-			 */
-			ExecStoreAllNullTuple(slot);
-
-			if (--node->return_empty_tuples == 0)
-			{
-				/* no more tuples to return in the next round */
-				node->tbmres = tbmres = NULL;
-			}
-		}
-		else
-		{
-			/*
-			 * Attempt to fetch tuple from AM.
-			 */
-			if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
-			{
-				/* nothing more to look at on this page */
-				node->tbmres = tbmres = NULL;
-				continue;
-			}
-
-			/*
-			 * If we are using lossy info, we have to recheck the qual
-			 * conditions at every tuple.
-			 */
-			if (tbmres->recheck)
+			// TODO: couldn't we have recheck set to true when it was only because
+			// the bitmap was lossy and not because the qual needs to be rechecked?
+			if (node->tbmres->recheck)
 			{
 				econtext->ecxt_scantuple = slot;
 				if (!ExecQualAndReset(node->bitmapqualorig, econtext))
@@ -327,16 +173,23 @@ BitmapHeapNext(BitmapHeapScanState *node)
 					continue;
 				}
 			}
+			return slot;
 		}
 
-		/* OK to return this tuple */
-		return slot;
-	}
+		/*
+		 * Get next page of results
+		 */
+		while (!table_scan_bitmap_next_block(scan, &node->tbmres));
 
-	/*
-	 * if we get here it means we are at the end of the scan..
-	 */
-	return ExecClearTuple(slot);
+		/* if we get here it means we are at the end of the scan */
+		if (node->tbmres == NULL)
+			return NULL;
+
+		if (node->tbmres->ntuples >= 0)
+			node->exact_pages++;
+		else
+			node->lossy_pages++;
+	}
 }
 
 /*
@@ -354,235 +207,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
 	ConditionVariableBroadcast(&pstate->cv);
 }
 
-/*
- *	BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-							 TBMIterateResult *tbmres)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (node->prefetch_pages > 0)
-		{
-			/* The main iterator has closed the distance by one page */
-			node->prefetch_pages--;
-		}
-		else if (prefetch_iterator)
-		{
-			/* Do not let the prefetch iterator get behind the main one */
-			TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
-			if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
-				elog(ERROR, "prefetch and main iterators are out of sync");
-		}
-		return;
-	}
-
-	if (node->prefetch_maximum > 0)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_pages > 0)
-		{
-			pstate->prefetch_pages--;
-			SpinLockRelease(&pstate->mutex);
-		}
-		else
-		{
-			/* Release the mutex before iterating */
-			SpinLockRelease(&pstate->mutex);
-
-			/*
-			 * In case of shared mode, we can not ensure that the current
-			 * blockno of the main iterator and that of the prefetch iterator
-			 * are same.  It's possible that whatever blockno we are
-			 * prefetching will be processed by another process.  Therefore,
-			 * we don't validate the blockno here as we do in non-parallel
-			 * case.
-			 */
-			if (prefetch_iterator)
-				tbm_shared_iterate(prefetch_iterator);
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max.  Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		if (node->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (node->prefetch_target >= node->prefetch_maximum / 2)
-			node->prefetch_target = node->prefetch_maximum;
-		else if (node->prefetch_target > 0)
-			node->prefetch_target *= 2;
-		else
-			node->prefetch_target++;
-		return;
-	}
-
-	/* Do an unlocked check first to save spinlock acquisitions. */
-	if (pstate->prefetch_target < node->prefetch_maximum)
-	{
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
-			pstate->prefetch_target = node->prefetch_maximum;
-		else if (pstate->prefetch_target > 0)
-			pstate->prefetch_target *= 2;
-		else
-			pstate->prefetch_target++;
-		SpinLockRelease(&pstate->mutex);
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-	/*
-	 * FIXME: This really should just all be replaced by using one iterator
-	 * and a PgStreamingRead. tbm_iterate() actually does a fair bit of work,
-	 * we don't want to repeat that. Nor is it good to do the buffer mapping
-	 * lookups twice.
-	 */
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-	bool		issued_prefetch = false;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (node->prefetch_pages < node->prefetch_target)
-			{
-				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-				bool		skip_fetch;
-
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = NULL;
-					break;
-				}
-				node->prefetch_pages++;
-
-				/*
-				 * If we expect not to have to actually read this heap page,
-				 * skip this prefetch call, but continue to run the prefetch
-				 * logic normally.  (Would it be better not to increment
-				 * prefetch_pages?)
-				 *
-				 * This depends on the assumption that the index AM will
-				 * report the same recheck flag for this future heap page as
-				 * it did for the current heap page; which is not a certainty
-				 * but is true in many cases.
-				 */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-				{
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-					issued_prefetch = true;
-				}
-			}
-		}
-
-		return;
-	}
-
-	if (pstate->prefetch_pages < pstate->prefetch_target)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (1)
-			{
-				TBMIterateResult *tbmpre;
-				bool		do_prefetch = false;
-				bool		skip_fetch;
-
-				/*
-				 * Recheck under the mutex. If some other process has already
-				 * done enough prefetching then we need not to do anything.
-				 */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_pages < pstate->prefetch_target)
-				{
-					pstate->prefetch_pages++;
-					do_prefetch = true;
-				}
-				SpinLockRelease(&pstate->mutex);
-
-				if (!do_prefetch)
-					return;
-
-				tbmpre = tbm_shared_iterate(prefetch_iterator);
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_shared_iterate(prefetch_iterator);
-					node->shared_prefetch_iterator = NULL;
-					break;
-				}
-
-				/* As above, skip prefetch if we expect not to need page */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-				{
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-					issued_prefetch = true;
-				}
-			}
-		}
-	}
-
-	/*
-	 * The PrefetchBuffer() calls staged IOs, but didn't necessarily submit
-	 * them, as it is more efficient to amortize the syscall cost across
-	 * multiple calls.
-	 */
-	if (issued_prefetch)
-		pgaio_submit_pending(true);
-#endif							/* USE_PREFETCH */
-}
 
 /*
  * BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
@@ -631,27 +255,18 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	/* release bitmaps and buffers if any */
 	if (node->tbmiterator)
 		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
 	if (node->shared_tbmiterator)
 		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
 	if (node->vmbuffer != InvalidBuffer)
 		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 	node->tbm = NULL;
 	node->tbmiterator = NULL;
 	node->tbmres = NULL;
-	node->prefetch_iterator = NULL;
 	node->initialized = false;
 	node->shared_tbmiterator = NULL;
-	node->shared_prefetch_iterator = NULL;
 	node->vmbuffer = InvalidBuffer;
-	node->pvmbuffer = InvalidBuffer;
 
 	ExecScanReScan(&node->ss);
 
@@ -699,18 +314,12 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 	 */
 	if (node->tbmiterator)
 		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
 	if (node->shared_tbmiterator)
 		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
 	if (node->vmbuffer != InvalidBuffer)
 		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 
 	/*
 	 * close heap scan
@@ -750,18 +359,12 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->tbm = NULL;
 	scanstate->tbmiterator = NULL;
 	scanstate->tbmres = NULL;
-	scanstate->return_empty_tuples = 0;
 	scanstate->vmbuffer = InvalidBuffer;
-	scanstate->pvmbuffer = InvalidBuffer;
 	scanstate->exact_pages = 0;
 	scanstate->lossy_pages = 0;
-	scanstate->prefetch_iterator = NULL;
-	scanstate->prefetch_pages = 0;
-	scanstate->prefetch_target = 0;
 	scanstate->pscan_len = 0;
 	scanstate->initialized = false;
 	scanstate->shared_tbmiterator = NULL;
-	scanstate->shared_prefetch_iterator = NULL;
 	scanstate->pstate = NULL;
 
 	/*
@@ -812,13 +415,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->bitmapqualorig =
 		ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
 
-	/*
-	 * Maximum number of prefetches for the tablespace if configured,
-	 * otherwise the current value of the effective_io_concurrency GUC.
-	 */
-	scanstate->prefetch_maximum =
-		get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
 	scanstate->ss.ss_currentRelation = currentRelation;
 
 	scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
@@ -909,12 +505,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 	pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
 
 	pstate->tbmiterator = 0;
-	pstate->prefetch_iterator = 0;
 
 	/* Initialize the mutex */
 	SpinLockInit(&pstate->mutex);
-	pstate->prefetch_pages = 0;
-	pstate->prefetch_target = 0;
 	pstate->state = BM_INITIAL;
 
 	ConditionVariableInit(&pstate->cv);
@@ -946,11 +539,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 	if (DsaPointerIsValid(pstate->tbmiterator))
 		tbm_free_shared_area(dsa, pstate->tbmiterator);
 
-	if (DsaPointerIsValid(pstate->prefetch_iterator))
-		tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
 	pstate->tbmiterator = InvalidDsaPointer;
-	pstate->prefetch_iterator = InvalidDsaPointer;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index c5feacbff4..9ac0fa98d0 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -180,7 +180,6 @@ struct TBMIterator
 	int			spageptr;		/* next spages index */
 	int			schunkptr;		/* next schunks index */
 	int			schunkbit;		/* next bit to check in current schunk */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /*
@@ -221,7 +220,6 @@ struct TBMSharedIterator
 	PTEntryArray *ptbase;		/* pagetable element array */
 	PTIterationArray *ptpages;	/* sorted exact page index list */
 	PTIterationArray *ptchunks; /* sorted lossy page index list */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /* Local function prototypes */
@@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * Create the TBMIterator struct, with enough trailing space to serve the
 	 * needs of the TBMIterateResult sub-struct.
 	 */
-	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-									  MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+	iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
 	iterator->tbm = tbm;
 
 	/*
@@ -966,11 +963,10 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
  * be examined, but the condition must be rechecked anyway.  (For ease of
  * testing, recheck is always set true when ntuples < 0.)
  */
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
 {
 	TIDBitmap  *tbm = iterator->tbm;
-	TBMIterateResult *output = &(iterator->output);
 
 	Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
 
@@ -1008,11 +1004,11 @@ tbm_iterate(TBMIterator *iterator)
 			chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			iterator->schunkbit++;
-			return output;
+			return;
 		}
 	}
 
@@ -1028,16 +1024,16 @@ tbm_iterate(TBMIterator *iterator)
 			page = tbm->spages[iterator->spageptr];
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		iterator->spageptr++;
-		return output;
+		return;
 	}
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
 }
 
 /*
@@ -1047,10 +1043,9 @@ tbm_iterate(TBMIterator *iterator)
  *	across multiple processes.  We need to acquire the iterator LWLock,
  *	before accessing the shared members.
  */
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
 {
-	TBMIterateResult *output = &iterator->output;
 	TBMSharedIteratorState *istate = iterator->state;
 	PagetableEntry *ptbase = NULL;
 	int		   *idxpages = NULL;
@@ -1101,13 +1096,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 			chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			istate->schunkbit++;
 
 			LWLockRelease(&istate->lock);
-			return output;
+			return;
 		}
 	}
 
@@ -1117,21 +1112,21 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 		int			ntuples;
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		istate->spageptr++;
 
 		LWLockRelease(&istate->lock);
 
-		return output;
+		return;
 	}
 
 	LWLockRelease(&istate->lock);
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
 }
 
 /*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 670a40b4be..1122c098c7 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -352,6 +352,11 @@ typedef struct GinScanEntryData
 	TIDBitmap  *matchBitmap;
 	TBMIterator *matchIterator;
 	TBMIterateResult *matchResult;
+	// TODO: a temporary hack to deal with the fact that I am
+	//  1) not sure if InvalidBlockNumber can come up for other reasons than exhausting the bitmap
+	// and 2) not having taken the time yet to check all the places where matchResult == NULL
+	// is used to make sure I can replace it with something else
+	TBMIterateResult *savedMatchResult;
 
 	/* used for Posting list and one page in Posting tree */
 	ItemPointerData *list;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 331f5c6716..c4d653e923 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "nodes/execnodes.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -225,5 +226,6 @@ extern bool ResolveCminCmaxDuringDecoding(struct HTAB *tuplecid_data,
 										  CommandId *cmin, CommandId *cmax);
 extern void HeapCheckForSerializableConflictOut(bool valid, Relation relation, HeapTuple tuple,
 												Buffer buffer, Snapshot snapshot);
+extern void bitmapheap_pgsr_alloc(BitmapHeapScanState *scanstate);
 
 #endif							/* HEAPAM_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 414b6b4d57..fea54384ec 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -786,7 +786,7 @@ typedef struct TableAmRoutine
 	 * scan_bitmap_next_tuple need to exist, or neither.
 	 */
 	bool		(*scan_bitmap_next_block) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres);
+										   struct TBMIterateResult **tbmres);
 
 	/*
 	 * Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1929,7 +1929,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
  */
 static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres)
+							 struct TBMIterateResult **tbmres)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 3b0bd5acb8..64d8c6a07c 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..fc69ea55d9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1532,11 +1532,7 @@ typedef enum
 /* ----------------
  *	 ParallelBitmapHeapState information
  *		tbmiterator				iterator for scanning current pages
- *		prefetch_iterator		iterator for prefetching ahead of current page
- *		mutex					mutual exclusion for the prefetching variable
- *								and state
- *		prefetch_pages			# pages prefetch iterator is ahead of current
- *		prefetch_target			current target prefetch distance
+ *		mutex					mutual exclusion for the state
  *		state					current state of the TIDBitmap
  *		cv						conditional wait variable
  *		phs_snapshot_data		snapshot data shared to workers
@@ -1545,10 +1541,7 @@ typedef enum
 typedef struct ParallelBitmapHeapState
 {
 	dsa_pointer tbmiterator;
-	dsa_pointer prefetch_iterator;
 	slock_t		mutex;
-	int			prefetch_pages;
-	int			prefetch_target;
 	SharedBitmapState state;
 	ConditionVariable cv;
 	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1559,22 +1552,16 @@ typedef struct ParallelBitmapHeapState
  *
  *		bitmapqualorig	   execution state for bitmapqualorig expressions
  *		tbm				   bitmap obtained from child index scan(s)
- *		tbmiterator		   iterator for scanning current pages
+ *		tbmiterator		   iterator for scanning pages
  *		tbmres			   current-page data
  *		can_skip_fetch	   can we potentially skip tuple fetches in this scan?
  *		return_empty_tuples number of empty tuples to return
  *		vmbuffer		   buffer for visibility-map lookups
- *		pvmbuffer		   ditto, for prefetched pages
  *		exact_pages		   total number of exact pages retrieved
  *		lossy_pages		   total number of lossy pages retrieved
- *		prefetch_iterator  iterator for prefetching ahead of current page
- *		prefetch_pages	   # pages prefetch iterator is ahead of current
- *		prefetch_target    current target prefetch distance
- *		prefetch_maximum   maximum value for prefetch_target
  *		pscan_len		   size of the shared memory for parallel bitmap
  *		initialized		   is node is ready to iterate
  *		shared_tbmiterator	   shared iterator
- *		shared_prefetch_iterator shared iterator for prefetching
  *		pstate			   shared state for parallel bitmap scan
  * ----------------
  */
@@ -1586,19 +1573,12 @@ typedef struct BitmapHeapScanState
 	TBMIterator *tbmiterator;
 	TBMIterateResult *tbmres;
 	bool		can_skip_fetch;
-	int			return_empty_tuples;
 	Buffer		vmbuffer;
-	Buffer		pvmbuffer;
 	long		exact_pages;
 	long		lossy_pages;
-	TBMIterator *prefetch_iterator;
-	int			prefetch_pages;
-	int			prefetch_target;
-	int			prefetch_maximum;
 	Size		pscan_len;
 	bool		initialized;
 	TBMSharedIterator *shared_tbmiterator;
-	TBMSharedIterator *shared_prefetch_iterator;
 	ParallelBitmapHeapState *pstate;
 } BitmapHeapScanState;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index bc67166105..236de80f23 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -23,6 +23,8 @@
 #define TIDBITMAP_H
 
 #include "storage/itemptr.h"
+// TODO: not great that I am including this now
+#include "storage/buf.h"
 #include "utils/dsa.h"
 
 
@@ -40,6 +42,7 @@ typedef struct TBMSharedIterator TBMSharedIterator;
 typedef struct TBMIterateResult
 {
 	BlockNumber blockno;		/* page number containing tuples */
+	Buffer buffer;
 	int			ntuples;		/* -1 indicates lossy result */
 	bool		recheck;		/* should the tuples be rechecked? */
 	/* Note: recheck is always true if ntuples < 0 */
@@ -64,8 +67,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
 
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
 extern void tbm_end_iterate(TBMIterator *iterator);
 extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
index 9a07f06b9f..8e1aa48827 100644
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -39,7 +39,7 @@ typedef enum IoMethod
 } IoMethod;
 
 /* We'll default to bgworker. */
-#define DEFAULT_IO_METHOD IOMETHOD_WORKER
+#define DEFAULT_IO_METHOD IOMETHOD_IO_URING
 
 /* GUCs */
 extern int io_method;
-- 
2.27.0

Reply via email to