On 18/12/2025 4:45 PM, Andres Freund wrote:
Hi,

On 2025-12-18 15:40:59 +0100, Tomas Vondra wrote:
The technical reason is that batch_getnext() does this:

   /* Delay initializing stream until reading from scan's second batch */
   if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
       enable_indexscan_prefetch)
       scan->xs_heapfetch->rs =
           read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
                                      ....);

which means we only create the read_stream (which is what enables the
prefetching) only when creating the second batch. And with LIMIT 100 we
likely read just a single leaf page (=batch) most of the time, which
means no read_stream and thus no prefetching.
Why is the logic tied to the number of batches, rather the number of items in
batches? It's not hard to come up with scenarios where having to wait for ~100
random pages will be the majority of the queries IO wait... It makes sense to
not initialize readahead if we just fetch an entry or two, but after that?


I did more experiments trying to understand when we can take advantage of prefetch:

So schema is the same:

create table t (pk integer, sk integer, payload text default repeat('x', 1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000),random()*10000000);
create index on t(sk);

select.sql:

\set sk random(1, 10000000)
select * from t where sk >= :sk order by sk limit N;

And I do
pgbench -n -T 30 -M prepared -f select.sql postgres

limit\prefetch    on      off     always  incremental
1                 12074   12765    3146    3282
2                   5912     6198    2463    2438
4                   2919     3047    1334    1964
8                   1554     1496    1166    1409
16                   815       775      947      940
32                   424       403      687      695
64                   223       208      446      453
128                 115       106      258      270
256                  68          53      138      149
512                  43          27       72         78
1024                28          13       38         40


prefetch=always means commenting of `priorbatch` check and immediate creation of read_stream:

        /* Delay initializing stream until reading from scan's second batch */ -        if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&+ +       if (/*priorbatch && */!scan->xs_heapfetch->rs && !batchqueue->disabled &&

prefetch=increment replaces doubling of prefetch distance with increment:

        /* Look-ahead distance ramps up rapidly after we do I/O. */
-        distance = stream->distance * 2;
+       distance = stream->distance ? stream->distance + 1 : 0;


So as you expected, immediate creation of read_stream cause quite significant degrade of performance on indexscans inspecting small number of TIDs. Looks like the threshold where read stream provides advantages in performance is about 10. After it earlier initialization of read stream adds quite noticeable performance improvement.

I tried to find out using profiler and debugger where most of the time is spent in this case and answer was quite predictable -
in read_stream_reset->read_stream_next_buffer.

So we just consuming pefetched buffers which we do not need.

I thought that we can use some better policy for increasing prefetch distance (right now it is just doubled).
This is why I have tried this "incremental" policy.
Unfortunately it  can not help to reduce prefetch overhead for "short" indexscans. But what surprised me is that for longer indexscans this approach seems to be slightly more efficient than doubling.


So look like we really should use number of items criteria for read stream initialization rather than number of batches.
And may be think about alternative policy for increasing prefetch distance.





Reply via email to