On 7/23/25 02:59, Andres Freund wrote:
> Hi,
>
> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>> But I don't see why would this have any effect on the prefetch distance,
>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>> that. I'd have expected exactly the opposite behavior.
>>
>> Could be bug, of course. But it'd be helpful to see the dataset/query.
>
> Pgbench scale 500, with the simpler query from my message.
>
I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
you monitor the queue depth (presumably iostat?), but I added a basic
prefetch info to explain (see the attached WIP patch), reporting the
average prefetch distance, number of stalls (with distance=0) and stream
resets (after filling INDEX_SCAN_MAX_BATCHES).
And I see this (there's a complete explain output attached) for the two
queries from your message [1]. The
simple query:
SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid
LIMIT 10000000);
complex query:
SELECT max(abalance), min(abalance), sum(abalance::numeric),
avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM
(SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
The stats actually look *exactly* the same, which makes sense because
it's reading the same index.
max_batches distance stalls resets stalls/reset
--------------------------------------------------------------------
64 272 3 3 1
32 59 122939 653 188
16 36 108101 1190 90
8 21 98775 2104 46
4 11 95627 4556 20
I think this behavior mostly matches my expectations, although it's
interesting the stalls jump so much between 64 and 32 batches.
I did test both with buffered I/O (io_method=sync) and direct I/O
(io_method=worker), and the results are exactly the same for me. Not the
timings, of course, but the prefetch stats.
Of course, maybe there's something wrong in how the stats are collected.
I wonder if maybe we should update the distance in get_block() and not
in next_buffer().
Or maybe there's some interference from having to read the leaf pages
sooner. But I don't see why that would affect the queue depth, fewer
reset should keep the queues fuller I think.
I'll think about adding some sort of distance histogram to the stats.
Maybe something like tinyhist [2] would work here.
[1]
https://www.postgresql.org/message-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql%407ppr6r6yfv4z
[2] https://github.com/tvondra/tinyhist
regards
--
Tomas Vondra
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4835c48b448..5e6c3208955 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -407,12 +407,6 @@ index_beginscan_internal(Relation indexRelation,
scan->parallel_scan = pscan;
scan->xs_temp_snap = temp_snap;
- /*
- * No batching by default, so set it to NULL. Will be initialized later if
- * batching is requested and AM supports it.
- */
- scan->xs_batches = NULL;
-
return scan;
}
@@ -463,6 +457,17 @@ index_rescan(IndexScanDesc scan,
orderbys, norderbys);
}
+void
+index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets)
+{
+ /* ugly */
+ if (scan->xs_heapfetch->rs != NULL)
+ {
+ read_stream_prefetch_stats(scan->xs_heapfetch->rs,
+ accum, count, stalls, resets);
+ }
+}
+
/* ----------------
* index_endscan - end a scan
* ----------------
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..d92f68d0533 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -136,6 +136,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors,
ExplainState *es);
static void show_hashagg_info(AggState *aggstate, ExplainState *es);
static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexprefetch_info(PlanState *planstate, ExplainState *es);
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
ExplainState *es);
static void show_instrumentation_count(const char *qlabel, int which,
@@ -1966,6 +1967,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+ show_indexprefetch_info(planstate, es);
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1983,6 +1985,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+ show_indexprefetch_info(planstate, es);
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -3889,6 +3892,45 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
}
+static void
+show_indexprefetch_info(PlanState *planstate, ExplainState *es)
+{
+ Plan *plan = planstate->plan;
+
+ int count = 0,
+ accum = 0,
+ stalls = 0,
+ resets = 0;
+
+ if (!es->analyze)
+ return;
+
+ /* Initialize counters with stats from the local process first */
+ switch (nodeTag(plan))
+ {
+ case T_IndexScan:
+ {
+ IndexScanState *indexstate = ((IndexScanState *) planstate);
+
+ count = indexstate->iss_PrefetchCount;
+ accum = indexstate->iss_PrefetchAccum;
+ stalls = indexstate->iss_PrefetchStalls;
+ resets = indexstate->iss_ResetCount;
+ break;
+ }
+ default:
+ break;
+ }
+
+ if (count > 0)
+ {
+ ExplainPropertyFloat("Prefetch Distance", NULL, (accum * 1.0 / count), 3, es);
+ ExplainPropertyUInteger("Prefetch Stalls", NULL, stalls, es);
+ ExplainPropertyUInteger("Prefetch Resets", NULL, resets, es);
+ }
+}
+
+
/*
* Show exact/lossy pages for a BitmapHeapScan node
*/
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 7fcaa37fe62..5511732aad2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -125,6 +125,12 @@ IndexNext(IndexScanState *node)
node->iss_OrderByKeys, node->iss_NumOrderByKeys);
}
+ index_get_prefetch_stats(scandesc,
+ &node->iss_PrefetchAccum,
+ &node->iss_PrefetchCount,
+ &node->iss_PrefetchStalls,
+ &node->iss_ResetCount);
+
/*
* ok, now that we have what we need, fetch the next tuple.
*/
@@ -1088,6 +1094,11 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
indexstate->iss_RuntimeContext = NULL;
}
+ indexstate->iss_PrefetchAccum = 0;
+ indexstate->iss_PrefetchCount = 0;
+ indexstate->iss_PrefetchStalls = 0;
+ indexstate->iss_ResetCount = 0;
+
/*
* all done.
*/
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..529cb5dcbee 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -106,6 +106,11 @@ struct ReadStream
bool advice_enabled;
bool temporary;
+ int distance_accum;
+ int distance_count;
+ int distance_stalls;
+ int reset_count;
+
/*
* One-block buffer to support 'ungetting' a block number, to resolve flow
* control problems when I/Os are split.
@@ -681,6 +686,11 @@ read_stream_begin_impl(int flags,
stream->seq_until_processed = InvalidBlockNumber;
stream->temporary = SmgrIsTemp(smgr);
+ stream->distance_accum = 0;
+ stream->distance_count = 0;
+ stream->distance_stalls = 0;
+ stream->reset_count = 0;
+
/*
* Skip the initial ramp-up phase if the caller says we're going to be
* reading the whole relation. This way we start out assuming we'll be
@@ -772,6 +782,16 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
Buffer buffer;
int16 oldest_buffer_index;
+ if (stream->distance > 0)
+ {
+ stream->distance_accum += stream->distance;
+ stream->distance_count += 1;
+ }
+ else
+ {
+ stream->distance_stalls += 1;
+ }
+
#ifndef READ_STREAM_DISABLE_FAST_PATH
/*
@@ -1046,6 +1066,8 @@ read_stream_reset(ReadStream *stream)
/* Start off assuming data is cached. */
stream->distance = 1;
+
+ stream->reset_count += 1;
}
/*
@@ -1057,3 +1079,12 @@ read_stream_end(ReadStream *stream)
read_stream_reset(stream);
pfree(stream);
}
+
+void
+read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets)
+{
+ *accum = stream->distance_accum;
+ *count = stream->distance_count;
+ *stalls = stream->distance_stalls;
+ *resets = stream->reset_count;
+}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 3a3a44be3a5..51c85414b0a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,6 +235,7 @@ extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
Datum attoptions, bool validate);
extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern void index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets);
/*
* index access method support routines (in genam.c)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f81..ae18c6a5125 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1722,6 +1722,11 @@ typedef struct IndexScanState
IndexScanInstrumentation iss_Instrument;
SharedIndexScanInstrumentation *iss_SharedInfo;
+ int iss_PrefetchAccum;
+ int iss_PrefetchCount;
+ int iss_PrefetchStalls;
+ int iss_ResetCount;
+
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
bool iss_ReachedEnd;
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 9b0d65161d0..53c754a3b3d 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -102,4 +102,6 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
extern void read_stream_reset(ReadStream *stream);
extern void read_stream_end(ReadStream *stream);
+extern void read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets);
+
#endif /* READ_STREAM_H */
INDEX_SCAN_MAX_BATCHES 64
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191305
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191305
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 271.915
Prefetch Stalls: 3
Prefetch Resets: 3
Buffers: shared read=191305
Planning:
Buffers: shared hit=46 read=22
Planning Time: 3.996 ms
Execution Time: 3593.458 ms
(14 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191305
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191305
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 271.915
Prefetch Stalls: 3
Prefetch Resets: 3
Buffers: shared read=191305
Planning:
Buffers: shared hit=61 read=25
Planning Time: 4.211 ms
Execution Time: 5163.463 ms
(14 rows)
INDEX_SCAN_MAX_BATCHES 32
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191290
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191290
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 58.670
Prefetch Stalls: 122939
Prefetch Resets: 653
Buffers: shared read=191290
Planning:
Buffers: shared hit=46 read=22
Planning Time: 3.797 ms
Execution Time: 3511.968 ms
(14 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191290
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191290
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 58.670
Prefetch Stalls: 122939
Prefetch Resets: 653
Buffers: shared read=191290
Planning:
Buffers: shared hit=61 read=25
Planning Time: 4.108 ms
Execution Time: 5150.282 ms
(14 rows)
INDEX_SCAN_MAX_BATCHES 16
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191271
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191271
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 35.562
Prefetch Stalls: 108101
Prefetch Resets: 1190
Buffers: shared read=191271
Planning:
Buffers: shared hit=46 read=22
Planning Time: 3.965 ms
Execution Time: 3568.817 ms
(14 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191271
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191271
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 35.562
Prefetch Stalls: 108101
Prefetch Resets: 1190
Buffers: shared read=191271
Planning:
Buffers: shared hit=61 read=25
Planning Time: 4.371 ms
Execution Time: 5108.845 ms
(14 rows)
INDEX_SCAN_MAX_BATCHES 8
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191270
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191270
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 21.344
Prefetch Stalls: 98775
Prefetch Resets: 2104
Buffers: shared read=191270
Planning:
Buffers: shared hit=46 read=22
Planning Time: 3.706 ms
Execution Time: 3646.034 ms
(14 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191270
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191270
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 21.344
Prefetch Stalls: 98775
Prefetch Resets: 2104
Buffers: shared read=191270
Planning:
Buffers: shared hit=61 read=25
Planning Time: 4.280 ms
Execution Time: 5198.426 ms
(14 rows)
INDEX_SCAN_MAX_BATCHES 4
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191266
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191266
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 10.688
Prefetch Stalls: 95627
Prefetch Resets: 4556
Buffers: shared read=191266
Planning:
Buffers: shared hit=46 read=22
Planning Time: 3.799 ms
Execution Time: 3784.910 ms
(14 rows)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Aggregate (actual rows=1.00 loops=1)
Buffers: shared read=191266
-> Limit (actual rows=10000000.00 loops=1)
Buffers: shared read=191266
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1)
Index Searches: 1
Prefetch Distance: 10.688
Prefetch Stalls: 95627
Prefetch Resets: 4556
Buffers: shared read=191266
Planning:
Buffers: shared hit=61 read=25
Planning Time: 4.330 ms
Execution Time: 5195.308 ms
(14 rows)