On 21/12/2025 7:55 PM, Peter Geoghegan wrote:
On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <[email protected]> wrote:
Attached is v4.
Attached is v5. Changes from v4:
* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.
There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.
* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.
The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.
The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.
--
Peter Geoghegan
I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.
Attached please find a patch which implements this approach.
And below are updated results:
limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38
Last column is result of prefetch with read_stream_threshold=10.
diff --git a/src/backend/access/heap/heapam_handler.c
b/src/backend/access/heap/heapam_handler.c
index b9d42b15a18..aac663edfdd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
#include "commands/progress.h"
#include "executor/executor.h"
#include "miscadmin.h"
+#include "optimizer/cost.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
@@ -46,6 +47,8 @@
#include "utils/builtins.h"
#include "utils/rel.h"
+int read_stream_threshold = DEFAULT_READ_STREAM_THRESHOLD;
+
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
Datum
*values, bool *isnull, RewriteState rwstate);
@@ -229,6 +232,7 @@ heap_batch_advance_pos(IndexScanDesc scan, struct
BatchQueueItemPos *pos,
ScanDirection direction)
{
BatchIndexScan batch;
+ int proceed_items;
/* make sure we have batching initialized and consistent */
batch_assert_batches_valid(scan);
@@ -288,6 +292,24 @@ heap_batch_advance_pos(IndexScanDesc scan, struct
BatchQueueItemPos *pos,
*/
batch = INDEX_SCAN_BATCH(scan, pos->batch);
+ proceed_items = ScanDirectionIsForward(direction)
+ ? pos->item - batch->firstItem
+ : batch->lastItem - pos->item;
+
+ /* Delay initializing stream until proceeding */
+ if (proceed_items >= read_stream_threshold
+ && !scan->xs_heapfetch->rs
+ && !scan->batchqueue->disabled
+ && !scan->xs_want_itup /* XXX prefetching disabled for IoS,
for now */
+ && enable_indexscan_prefetch)
+ {
+ scan->xs_heapfetch->rs =
+ read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+
scan->heapRelation, MAIN_FORKNUM,
+
scan->heapRelation->rd_tableam->index_getnext_stream,
+
scan, 0);
+ }
+
if (ScanDirectionIsForward(direction))
{
if (++pos->item <= batch->lastItem)
diff --git a/src/backend/utils/misc/guc_parameters.dat
b/src/backend/utils/misc/guc_parameters.dat
index 8f6fa6843cb..0c2081e32ba 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2322,6 +2322,15 @@
max => 'DBL_MAX',
},
+{ name => 'read_stream_threshold', type => 'int', context => 'PGC_USERSET',
group => 'QUERY_TUNING_COST',
+ short_desc => 'Minimal number of heap tuples for creation read stream.',
+ flags => 'GUC_EXPLAIN',
+ variable => 'read_stream_threshold',
+ boot_val => 'DEFAULT_READ_STREAM_THRESHOLD',
+ min => '0',
+ max => 'INT_MAX',
+},
+
{ name => 'recovery_end_command', type => 'string', context => 'PGC_SIGHUP',
group => 'WAL_ARCHIVE_RECOVERY',
short_desc => 'Sets the shell command that will be executed once at the end
of recovery.',
variable => 'recoveryEndCommand',
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 00f4c3d0011..97150433c99 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -28,6 +28,7 @@
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0
+#define DEFAULT_READ_STREAM_THRESHOLD 10
/* defaults for non-Cost parameters */
#define DEFAULT_RECURSIVE_WORKTABLE_FACTOR 10.0
@@ -72,6 +73,7 @@ extern PGDLLIMPORT bool enable_partition_pruning;
extern PGDLLIMPORT bool enable_presorted_aggregate;
extern PGDLLIMPORT bool enable_async_append;
extern PGDLLIMPORT int constraint_exclusion;
+extern PGDLLIMPORT int read_stream_threshold;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
double
index_pages, PlannerInfo *root);