Re: index prefetching

Tomas Vondra Fri, 27 Feb 2026 07:41:26 -0800

On 2/27/26 09:51, Alexandre Felipe wrote:
> 
> 
> On Fri, Feb 27, 2026 at 4:18 AM Andres Freund <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi,
> 
>     I'm planning to do some reviewing in the next days. In preparation I
>     just
>     retried a benchmark and saw some odd results.  After a while I was
>     able to
>     reproduce even with a simpler setup:
> 
> 
> 
>> I'm planning to do some reviewing in the next days. In preparation I just
> 
>> retried a benchmark and saw some odd results.
> 
> 
> Since we are talking about results I will share mine too :)
> 
> 
> The bottomline is: Prefetch is working, but it might make some things
> slower.
> 
> 
> It is obvious that this should better exploit IO for one single
> heavy query, in one single table.
> 
> It is not so obvious, to me, how this would behave when there are 
> multiple concurrent queries. It is not so obvious how this will 
> impact when multiple tables are queried at the same time. My feeling
> is that it should greatly improve on a disk with a mechanical head,
> if it performs the same reads reducing the number of times it has to
> jump from one to another. Is there much interest in special
> optimisations for those or is the focus more on SSDs?
> 

I don't follow your argument about spinning drives vs. SSDs.

Are you suggesting prefetching is not as important for SSDs? Prefetching
can help on systems with spinning drives, but I'd say
performance-sensitive systems are not using spinning drives anymore.

Yes, multiple backends executing queries and issuing I/O concurrently
may have a similar effect - the queues will be longer, and the storage
system will be better utilized. But what if there are not that many
active backends, or most backends are querying cached data?

I don't understand your point about querying multiple tables. Please
clarify why do you believe it makes prefetching less effective.

> 
> On my previous review I wasted way to much time trying to improve 
> read_stream, to end up getting just some mixed results. This time I 
> tried to step back and try to look at various functions that could
> have changed. Initially I tried  compiler function instrumentation,
> but then the profiling overhead of 33k functions dominated.
> 
> 
> This time what I did (1) added a indexscan_prefetch_distance, maybe 
> a better name would be just prefetch_distance, it limits the growth 
> of distance in read_stream (distance-limit.diff). (2) captured 
> execution statistics for 15 functions (profiling-
> instrumentation.diff). At exit each process will create a log with
> its configuration and call statistics.
> 

Nont sure what you're trying to archive by this instrumentation, and why
you didn't just do profiling using perf. Adding custom code to hot paths
is going to affect L1 cache hit ratios, for example.

Likewise, not sure what's the point of indexscan_prefetch_distance. What
are you trying to achieve by this? It likely maximizes the overhead by
forcing tiny I/O requests, which is the whole point why we allow the
distance to grow.

> 
> The benchmark was with full index scan on a sequential column,
> executed repeatedly and no cache eviction: buffer hit path.
> 
> 
> BENCHMARK RESULTS
> 
> 
> MacOS in normal (for me) use
> 
> Prefetch     Avg Time     Min Time     Max Time    
> 
> ------------------------------------------------
> 
> off           6.03s        5.12s        11.70s
> 
>   1          59.44s       25.33s       257.60s
> 
>   4          19.74s       12.66s        44.36s
> 
>  16          11.87s        7.49s        19.13s
> 
>  64           8.77s        6.05s        13.97s
> 
> 128           6.40s        4.33s        11.74s
> 
> 
> MacOS idle, after reboot
> 
> Prefetch     Avg Time     Min Time     Max Time    
> 
> ------------------------------------------------
> 
> off           2.17s        2.12s         2.26s
> 
>   1           5.53s        5.44s         5.57s
> 
>   4           3.17s        3.04s         3.39s
> 
>  16           3.13s        3.04s         3.29s
> 
>  64           2.82s        2.66s         2.88s
> 
> 128           2.83s        2.69s         2.90s
> 
> 
> Docker on MacOS, idle, after reboot
> 
> Prefetch     Avg Time     Min Time     Max Time    
> 
> ------------------------------------------------
> 
> off          1.38s       1.36s       1.46s
> 
>   1          3.65s       3.56s       3.70s
> 
>   4          2.00s       1.98s       2.09s
> 
>  16          1.56s       1.53s       1.59s
> 
>  64          1.29s       1.25s       1.33s
> 
> 128          1.28s       1.26s       1.32s
> 

I don't know what "in normal use" means. I don't have access to a MacOS
machine to figure out the filesystems behave under load, etc.

It'd be helpful if you could investigate that on that system. We don't
even know what exact hardware it has.

BTW is this with the custom instrumentation or not? How expensive is
clock_gettime on MacOS?

> 
> Docker on Linux
> 
> Prefetch     Avg Time     Min Time     Max Time    
> 
> ------------------------------------------------
> 
> off           6.07s        5.92s        6.29s
> 
>   1           6.85s        6.67s        7.04s
> 
>   4           6.26s        6.10s        6.41s
> 
>  16           6.14s        5.95s        6.30s
> 
>  64           5.74s        5.62s        5.91s
> 
> 128           5.72s        5.63s        5.86s
> 
> 
> The linux execution presented very little degradation. On MacOS host
the degradation was more noticeable than on MacOS docker running a
debian, suggesting that software ecosystem contributes, docker on MacOS
(arm), was slower than docker on a native linux (x86_64), here I could
be it is CPU architecture or OS kernel differences.
> 

No idea. You're the only one who has access to the system and can
investigate what's happening.

Are all the tests running on the same hardware? Or is MacOS on one
system and Linux on a different system?

> 
> 
> WHAT CHANGED
> 
> The benchmark will produced, 195 autovac_worker, and 3293 backend 
> and one bgworker log. For prefetch off the number of calls is 
> constant. For prefetch on they vary widely, but I am looking at the 
> total time per function, assuming that the differences in the number 
> of calls changes only how the work was partitioned but the final
> work was the same.
>

I don't understand. Number of what calls varies widely?

I'm not sure elapsed time (per wall-time) is particularly interesting.


> 
> With Docker version 28.3.0, build 38b7060, Python 3.10.18
> 
> 
> $ docker compose up --build benchmark
> 
> $ docker cp docker-postgres-1:/tmp/profiling ./docker-profiling
> 
> $ python compare_profiles.py docker-profiling
> 
> 
> Function                  off,d=0 on,d=128     Diff        % z-statistic
> 
> ------------------------------------------------------------------------
> 
> read_stream_next_buffer       0.0   3944.9  +3944.9     NEW%  +654.88
> 
> read_stream_look_ahead        0.0   2999.3  +2999.3     NEW%  +624.00
> 
> WaitReadBuffers              98.3    754.6   +656.3  +667.7%  +414.56
> 
> _bt_next                    748.7   1072.8   +324.1   +43.3%   +20.35
> 
> heapam_batch_getnext        788.4   1114.6   +326.2   +41.4%   +20.18
> 
> btgetbatch                  777.0   1096.7   +319.7   +41.2%   +20.14
> 
> IndexNext                 17031.7  10400.3  -6631.5   -38.9%  -249.51
> 
> _bt_first                    17.2     13.0     -4.2   -24.6%    10.56
> 
> 
> Function                  off,d=0   on,d=1     Diff        % z-statistic
> 
> ------------------------------------------------------------------------
> 
> read_stream_look_ahead        0.0  28135.9 +28135.9      NEW    N/A
> 
> read_stream_next_buffer       0.0 199245.7 +199245.7     NEW    N/A
> 
> IndexNext                 17031.7 211861.0 +194829.2   12.4x  +283.00
> 
> WaitReadBuffers              98.3 169641.9 +169543.6   1724x  +275.63
> 
> heapam_index_fetch_tuple  13564.5 205828.2 +192263.7     15x  +172.56
> 
> heapam_batch_getnext        788.4   1944.0  +1155.6  +146.6%   +25.39
> 
> _bt_next                    748.7   1833.9  +1085.2  +144.9%   +24.85
> 
> btgetbatch                  777.0   1881.2  +1104.2  +142.1%   +24.74
> 
> _bt_first                    17.2     19.3     +2.1   +12.0%   +10.33
> 
> PS.: The docker environment cache eviction requires adjustments.
> 

No idea. There's almost no information about the configuration, so I
can't even try reproducing this on a Linux.


regards

-- 
Tomas Vondra
Re: index prefetching

Reply via email to