Hi,

On 2026-02-15 00:13:39 +0000, Alexandre Felipe wrote:
> For the tests, I disable sorting, sequential scans, index only scans and
> bitmap scans.
> Since buffer cache always has a significant impact on the query
> performance, I shuffled the tests, and tried to adjust for the number of
> buffer hit/read, but later I found that the best way to control that was to
> use a table small enough to be entirely held in cache, and evict the
> buffers.

I doubt that shuffling the tests will provide meaningful uncached performance
in a test like this - your data size is tiny, so there's simply not a whole
lot of data, i.e. it all stays cached.  Which is also what is going to limit
the amount of benefit from prefetching you're going to get.

Evicting from the postgres buffercache won't do much, because you're not going
to evict it from the kernel cache, so there's no actual IO that prefetching
can avoid.


I think generally your queries are not particularly IO bound, due to the small
size.  I've adapted your data creation script to create a table with the
number of rows suffixed and created it for 100k and 1M rows.


100k rows:

prefetching=0:

 Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15351.09 rows=100000 width=8) (actual time=0.026..201.730 
rows=100000.00 loops>
   Index Searches: 1
   Buffers: shared hit=96875 read=3400
   I/O Timings: shared read=32.727
 Planning:
   Buffers: shared hit=81 read=23
   I/O Timings: shared read=1.521
 Planning Time: 1.851 ms
 Execution Time: 207.031 ms

prefetching=1:

 Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15351.09 rows=100000 width=8) (actual time=0.209..193.508 
rows=100000.00 loops=1)
   Index Searches: 1
   Buffers: shared hit=96875 read=3400
   I/O Timings: shared read=29.830
 Planning:
   Buffers: shared hit=81 read=23
   I/O Timings: shared read=1.962
 Planning Time: 2.299 ms
 Execution Time: 198.809 ms
(9 rows)

Here about 15.8% of the time is spent waiting for IO, improved by prefetching
only to 14.4%.


1M rows:

prefetching=0

 Index Scan using idx_periodic_1000000 on prefetch_test_data_1000000  
(cost=0.42..153478.15 rows=1000000 width=8) (actual time=0.126..5172.605 
rows=1000000.00 loops=1)
   Index Searches: 1
   Buffers: shared hit=971485 read=31250
   I/O Timings: shared read=3371.993
 Planning:
   Buffers: shared hit=96 read=7
   I/O Timings: shared read=0.591
 Planning Time: 0.946 ms
 Execution Time: 5225.337 ms


prefetching=1:

 Index Scan using idx_periodic_1000000 on prefetch_test_data_1000000  
(cost=0.42..153478.15 rows=1000000 width=8) (actual time=0.126..2106.244 
rows=1000000.00 loops=1)
   Index Searches: 1
   Buffers: shared hit=971485 read=31250
   I/O Timings: shared read=94.310
 Planning:
   Buffers: shared hit=96 read=7
   I/O Timings: shared read=0.585
 Planning Time: 0.938 ms
 Execution Time: 2157.727 ms

Here about 64.5% of the time is spent waiting for IO, improved by prefetching
to 4.3%.


I think one reason the smaller data performs so badly is that read stream
ramps down the distance too quickly, due to the high cache hit ratio, at least
in the periodic case.  If I disable the stream->distance-- in read_stream.c,
the results, particularly with debug_io_direct=data, are vastly better ([1]).  
It's
easier to see with DIO:


debug_io_direct=data, enable_indexscan_prefetch=0
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                            
QUERY PLAN                                                                      
      │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15351.09 rows=100000 width=8) (actual time=0.311..643.697 
rows=100000.00 loops=1) │
│   Index Searches: 1                                                           
                                                                                
   │
│   Buffers: shared hit=97150 read=3125                                         
                                                                                
   │
│   I/O Timings: shared read=479.851                                            
                                                                                
   │
│ Planning Time: 0.066 ms                                                       
                                                                                
   │
│ Execution Time: 648.820 ms                                                    
                                                                                
   │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

debug_io_direct=data, enable_indexscan_prefetch=1, w/ distance--
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                            
QUERY PLAN                                                                      
      │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15351.09 rows=100000 width=8) (actual time=0.570..676.908 
rows=100000.00 loops=1) │
│   Index Searches: 1                                                           
                                                                                
   │
│   Prefetch: distance=2.000 count=6008 stalls=93627 skipped=0 resets=0 
pauses=0 ungets=0 forwarded=0                                                   
           │
│             histogram [2,4) => 6008                                           
                                                                                
   │
│   Buffers: shared hit=97150 read=3125                                         
                                                                                
   │
│   I/O Timings: shared read=507.223                                            
                                                                                
   │
│ Planning Time: 0.056 ms                                                       
                                                                                
   │
│ Execution Time: 682.205 ms                                                    
                                                                                
   │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

debug_io_direct=data, enable_indexscan_prefetch=1, w/o distance--
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                            
QUERY PLAN                                                                      
      │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15351.09 rows=100000 width=8) (actual time=0.304..187.170 
rows=100000.00 loops=1) │
│   Index Searches: 1                                                           
                                                                                
   │
│   Prefetch: distance=750.741 count=99721 stalls=4 skipped=0 resets=0 
pauses=90 ungets=0 forwarded=0                                                  
            │
│             histogram [2,4) => 3, [4,8) => 4, [8,16) => 6, [16,32) => 10, 
[32,64) => 25, [64,128) => 33, [128,256) => 65, [256,512) => 130, [512,1024) => 
99445  │
│   Buffers: shared hit=97150 read=3125                                         
                                                                                
   │
│   I/O Timings: shared read=25.086                                             
                                                                                
   │
│ Planning Time: 0.062 ms                                                       
                                                                                
   │
│ Execution Time: 192.363 ms                                                    
                                                                                
   │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

If the distance ends up capped at 2, it's not too surprising that readahead
doesn't work out.


FWIW, all that's seemingly needed to actually have mostly sufficient distance
is to change the distance increase from
  distance = stream->distance * 2;
to
  distance = stream->distance * 2 + 1;


The + 1 has a big effect because when the distance is 1, each hit reduces the
distance by as much as each miss does. Just addressing that improves the
results dramatically:

debug_io_direct=data, enable_indexscan_prefetch=1, w/ stream->distance * 2 + 1
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                             
QUERY PLAN                                                                      
        │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15351.09 rows=100000 width=8) (actual time=0.316..176.703 
rows=100000.00 loops=1)    │
│   Index Searches: 1                                                           
                                                                                
      │
│   Prefetch: distance=707.476 count=11158 stalls=88503 skipped=0 resets=0 
pauses=26 ungets=0 forwarded=0                                                  
           │
│             histogram [2,4) => 5, [4,8) => 11, [8,16) => 26, [16,32) => 30, 
[32,64) => 54, [64,128) => 109, [128,256) => 221, [256,512) => 428, [512,1024) 
=> 10274 │
│   Buffers: shared hit=96875 read=3400                                         
                                                                                
      │
│   I/O Timings: shared read=33.874                                             
                                                                                
      │
│ Planning:                                                                     
                                                                                
      │
│   Buffers: shared hit=78 read=21                                              
                                                                                
      │
│   I/O Timings: shared read=2.772                                              
                                                                                
      │
│ Planning Time: 3.065 ms                                                       
                                                                                
      │
│ Execution Time: 182.959 ms                                                    
                                                                                
      │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The stall stats are bogus, because they get increased even when we correctly
are not prefetching due to everything being in shared buffers. I think the
  if (distance == 1) stats.nstalls++
would need to be just before the WaitReadBuffers().


Interestingly the stream->distance * 2 + 1 performs considerably better than
never decreasing the distance for the periodic case, which I guess makes
sense, because once the data is in s_b, it is purely cost to have a high
distance (for now, eventually I hope we can do batched buffer lookups to have
a small gain even in that case).


Greetings,

Andres Freund

[1] That's of course not an actual proposal, just a test


Reply via email to