Re: index prefetching

Tomas Vondra Sun, 15 Feb 2026 13:36:00 -0800

On 2/15/26 21:59, Andres Freund wrote:
> Hi,
> 
> On 2026-02-15 14:34:07 -0500, Andres Freund wrote:
>> debug_io_direct=data, enable_indexscan_prefetch=1, w/ stream->distance * 2 + 
>> 1
>> ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
>> │                                                                            
>>  QUERY PLAN                                                                  
>>             │
>> ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
>> │ Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
>> (cost=0.29..15351.09 rows=100000 width=8) (actual time=0.316..176.703 
>> rows=100000.00 loops=1)    │
>> │   Index Searches: 1                                                        
>>                                                                              
>>             │
>> │   Prefetch: distance=707.476 count=11158 stalls=88503 skipped=0 resets=0 
>> pauses=26 ungets=0 forwarded=0                                               
>>               │
>> │             histogram [2,4) => 5, [4,8) => 11, [8,16) => 26, [16,32) => 
>> 30, [32,64) => 54, [64,128) => 109, [128,256) => 221, [256,512) => 428, 
>> [512,1024) => 10274 │
>> │   Buffers: shared hit=96875 read=3400                                      
>>                                                                              
>>             │
>> │   I/O Timings: shared read=33.874                                          
>>                                                                              
>>             │
>> │ Planning:                                                                  
>>                                                                              
>>             │
>> │   Buffers: shared hit=78 read=21                                           
>>                                                                              
>>             │
>> │   I/O Timings: shared read=2.772                                           
>>                                                                              
>>             │
>> │ Planning Time: 3.065 ms                                                    
>>                                                                              
>>             │
>> │ Execution Time: 182.959 ms                                                 
>>                                                                              
>>             │
>> └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>>
>> The stall stats are bogus, because they get increased even when we correctly
>> are not prefetching due to everything being in shared buffers. I think the
>>   if (distance == 1) stats.nstalls++
>> would need to be just before the WaitReadBuffers().
> 
> The histogram and distance are also somewhat misleading: They measure what the
> distance is at the time the next block is determined, but that's not really
> informative, as the distance can be much bigger than what we are actually
> doing IO wise (to allow for IO combining etc).  The limit for the number of
> in-flight IOs will be the limiting factor in a case with random-ish IOs and
> it's also really what matters for performance.
>


This EXPLAIN part was hacked together as something to help us during
development, and a lot of the information is wonky and not well defined.
Which is why we chose not to include it in the patches posted to
hackers, so I'm a bit confused which patch / branch you're looking at.

For stalls you're probably right. I'll think about it.

I'm not sure about the distance shown. What do you mean by "the distance
can be much bigger than what we are actually doing IO wise"? IIRC in
that particular case we needed to know how far ahead is the "prefetch
position" (I mean, how many index entries are in between). We might have
*skipped* many of those for various reasons - duplicate blocks,
all-visible pages in index-only scans, pages already in shared buffers,
and so on. And I think that's still an interesting information, because
it tells us how many "batches ahead" we need.

Maybe we need to track multiple "distances"? I'm a bit worried about
overwhelming people, though.

> For a query with lots of random prefetching, run with
> effective_io_concurrency=8, compare:
> 
> stats using stream->distance:
> 
>  Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
> time=0.138..5425.851 rows=1000000.00 loops=1)
>    Buffers: shared hit=701118 read=301757
>    I/O Timings: shared read=3919.597
>    ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
> (cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.137..5331.243 
> rows=1000000.00 loops=1)
>          Index Searches: 1
>          Prefetch: distance=143.999 count=1297237 stalls=6 skipped=3 resets=0 
> pauses=0 ungets=297462 forwarded=2
>                    histogram [16,32) => 9, [32,64) => 2, [64,128) => 2, 
> [128,256) => 1297224
>          Buffers: shared hit=701118 read=301757
>          I/O Timings: shared read=3919.597
>  Planning:
>    Buffers: shared hit=96 read=7
>    I/O Timings: shared read=0.812
>  Planning Time: 1.310 ms
>  Execution Time: 5496.622 ms
> 
> 
> stats using stream->ios_in_progress:
> 
>  Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
> time=0.157..5427.138 rows=1000000.00 loops=1)
>    Buffers: shared hit=701118 read=301757
>    I/O Timings: shared read=3919.857
>    ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
> (cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.156..5332.607 
> rows=1000000.00 loops=1)
>          Index Searches: 1
>          Prefetch: distance=6.864 count=1297237 stalls=6 skipped=3 resets=0 
> pauses=0 ungets=297462 forwarded=106
>                    histogram [1,2) => 1164, [2,4) => 12634, [4,8) => 1283335
>          Buffers: shared hit=701118 read=301757
>          I/O Timings: shared read=3919.857
>  Planning:
>    Buffers: shared hit=96 read=7
>    I/O Timings: shared read=0.809
>  Planning Time: 1.305 ms
>  Execution Time: 5502.332 ms
> 
> 
> The former sure makes it look like that the bottleneck isn't the number of IOs
> in flight, after all the distance is quite deep.  Yet, if you increase
> effective_io_concurrency:
> 
> stats using stream->distance:
> 
>  Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
> time=0.138..2769.847 rows=1000000.00 loops=1)
>    Buffers: shared hit=701960 read=301786
>    I/O Timings: shared read=1100.819
>    ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
> (cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.137..2667.780 
> rows=1000000.00 loops=1)
>          Index Searches: 1
>          Prefetch: distance=1038.301 count=1293137 stalls=0 skipped=3 
> resets=0 pauses=694 ungets=291801 forwarded=2
>                    histogram [16,32) => 16, [32,64) => 18, [64,128) => 33, 
> [128,256) => 2, [256,512) => 3, [512,1024) => 125045, [1024,2048) => 1168020
>          Buffers: shared hit=701960 read=301786
>          I/O Timings: shared read=1100.819
>  Planning:
>    Buffers: shared hit=96 read=7
>    I/O Timings: shared read=0.819
>  Planning Time: 1.346 ms
>  Execution Time: 2845.891 ms
> (14 rows)
> 
> 
> stats using stream->ios_in_progress:
> 
>  Limit  (cost=0.43..150968.68 rows=1000000 width=8) (actual 
> time=0.136..2767.839 rows=1000000.00 loops=1)
>    Buffers: shared hit=701960 read=301786
>    I/O Timings: shared read=1102.115
>    ->  Index Scan using idx_random_10000000 on prefetch_test_data_10000000  
> (cost=0.43..1509682.87 rows=10000000 width=8) (actual time=0.135..2665.838 
> rows=1000000.00 loops=1)
>          Index Searches: 1
>          Prefetch: distance=61.054 count=1293134 stalls=0 skipped=3 resets=0 
> pauses=694 ungets=291798 forwarded=4
>                    histogram [1,2) => 1, [2,4) => 2, [4,8) => 4, [8,16) => 9, 
> [16,32) => 8737, [32,64) => 1284379
>          Buffers: shared hit=701960 read=301786
>          I/O Timings: shared read=1102.115
>  Planning:
>    Buffers: shared hit=96 read=7
>    I/O Timings: shared read=0.819
>  Planning Time: 1.334 ms
>  Execution Time: 2846.017 ms
> 
> 
> If you want to measure in the domain of blocks, rather than, I'd suggest at
> least using the number of currently pinned buffers, rather than the
> distance. But I think IOs is a much more useful measure.
> 
> E.g. for prefetching a index scan on a sequential column, with:
> 
> io_effective_io_concurrency=128 io_combine_limit=16
> 
> stats using stream->distance:
>    Prefetch: distance=18.872 count=320174 stalls=30736 skipped=9687128 
> resets=0 pauses=7684 ungets=0 forwarded=0
>              histogram [1,2) => 3416, [2,4) => 23052, [4,8) => 53788, [8,16) 
> => 115244, [16,32) => 124654, [32,64) => 20
>    Buffers: shared hit=27325 read=312500
>    I/O Timings: shared read=1794.060
> 
> 
> stats using stream->pinned_buffers:
>    Prefetch: distance=8.259 count=320174 stalls=30736 skipped=9687128 
> resets=0 pauses=7684 ungets=0 forwarded=29897
>              histogram [1,2) => 23052, [2,4) => 58070, [4,8) => 84497, [8,16) 
> => 105873, [16,32) => 18785
>    Buffers: shared hit=27325 read=312500
>    I/O Timings: shared read=887.850
> 
> 
> stats using stream->ios_in_progress:
>    Prefetch: distance=0.059 count=320174 stalls=30736 skipped=9687128 
> resets=0 pauses=7684 ungets=0 forwarded=301389
>              histogram [1,2) => 18785
>    Buffers: shared hit=27325 read=312500
>    I/O Timings: shared read=881.723
> 
> 
> Only the ios-in_progress one really shows that we are basically not doing
> any readahead, due to heapam pausing, thinking we're far enough ahead - but
> due to IO combining the batchdistance limit is reached before we can issue
> multiple IOs.
> 
> 
> io_effective_io_concurrency=128 io_combine_limit=4:
> 
> stats using stream->ios_in_progress:
>    Prefetch: distance=4.200 count=328882 stalls=1 skipped=9687128 resets=0 
> pauses=16391 ungets=0 forwarded=9
>              histogram [1,2) => 8, [2,4) => 129775, [4,8) => 199090
>    Buffers: shared hit=27325 read=312500
>    I/O Timings: shared read=300.284
> 
> 
> FWIW, if I change the batchdistance <= 2 check to <= 8, I get good perf even
> with io_combine_limit=16:
> 
> stats using stream->ios_in_progress:
>    Prefetch: distance=2.605 count=315526 stalls=3 skipped=9687128 resets=0 
> pauses=3035 ungets=0 forwarded=50
>              histogram [1,2) => 72679, [2,4) => 170115, [4,8) => 72682
>    Buffers: shared hit=27325 read=312500
>    I/O Timings: shared read=125.902
> 
> but that was just an experiment.
> 

I'll take a close look tomorrow, but AFAICS we really aim to measure two
rather different things. I've been interested in "how far ahead are we
looking" and you're more interested in the number of I/Os initiated by
the stream. Which both seem interesting and necessary to understand
what's going on.


regards

-- 
Tomas Vondra

Re: index prefetching

Reply via email to