Hi,

On 2026-02-15 22:17:05 +0100, Tomas Vondra wrote:
> I don't have access to a M1 machine (and it also does not say what type
> of storage is it using, which seems pretty important for a patch aiming
> to improve I/O behavior). But I tried running this on my ryzen machine
> with local SSDs (in RAID0), and with the 100k rows (and fixed handling
> of page cache) I get this:
> 
>   column_name io_method  evict  n  master_ms  off_ms  on_ms  effect_pct
>      periodic    worker    off 10       35.8    35.1   36.5         2.0
>      periodic    worker     os 10       49.4    49.9   58.8         8.1
>      periodic    worker     pg 10       39.5    39.9   47.1         8.3
>        random    worker    off 10       35.9    35.6   35.7         0.2
>        random    worker     os 10       49.0    49.0   42.6        -7.0
>        random    worker     pg 10       39.6    39.9   40.9         1.2
>    sequential    worker    off 10       28.2    27.9   27.7        -0.4
>    sequential    worker     os 10       39.3    39.2   34.8        -6.0
>    sequential    worker     pg 10       30.1    30.1   29.4        -1.3
> 
>   column_name io_method  evict  n  master_ms  off_ms  on_ms  effect_pct
>      periodic  io_uring    off 10       35.9    35.8   35.8        -0.1
>      periodic  io_uring     os 10       49.3    49.9   50.0         0.1
>      periodic  io_uring     pg 10       40.1    39.8   41.7         2.4
>        random  io_uring    off 10       35.6    35.2   35.7         0.8
>        random  io_uring     os 10       49.1    48.9   46.1        -3.0
>        random  io_uring     pg 10       39.8    40.1   42.6         3.1
>    sequential  io_uring    off 10       28.0    27.8   28.0         0.4
>    sequential  io_uring     os 10       39.8    39.1   40.7         1.9
>    sequential  io_uring     pg 10       30.2    30.0   29.6        -0.8
> 
> This is on default config with io_workers=12 and data_checksums=off. I'm
> not showing results for parallel query, because it's irrelevant.
> 
> This also has timings for master, for worker and io_uring (which you
> could not get on M1, at least no in MacOS). For "worker" the differences
> are much smaller (within 10% in the worst case), and almost non-existent
> for io_uring. Which suggests this is likely due to the "signal" overhead
> associated with worker, which can be annoying for certain data patterns
> (where we end up issuing an I/O for individual blocks at distance 1).

I don't think this is just the signalling issue. For "periodic" I think it's
the signalling issue triggered by the read stream distance being kept too
low. Due to the small distance, the latency affects us much more.

Any my system, with turbo boost etc disabled.

worker w/ enable_indexscan_prefetch=0:

 Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..84.129 
rows=100000.00 loops=1)
   Index Searches: 1
   Buffers: shared hit=97150 read=3125
   I/O Timings: shared read=31.274
 Planning:
   Buffers: shared hit=97 read=7
   I/O Timings: shared read=0.595
 Planning Time: 0.944 ms
 Execution Time: 89.319 ms


worker w/ enable_indexscan_prefetch=1:

 Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15101.09 rows=100000 width=208) (actual time=0.158..115.279 
rows=100000.00 loops=1)
   Index Searches: 1
   Prefetch: distance=1.060 count=99635 stalls=3004 skipped=0 resets=0 pauses=0 
ungets=0 forwarded=0
             histogram [1,2) => 93627, [2,4) => 6008
   Buffers: shared hit=97150 read=3125
   I/O Timings: shared read=56.077
 Planning:
   Buffers: shared hit=97 read=7
   I/O Timings: shared read=0.612
 Planning Time: 0.994 ms
 Execution Time: 120.575 ms

Right, a regression. But note how low the distance is - no wonder the worker
latency has a bad effect - we only have the downside, never the upside, as
there's pretty much no IO concurrency.


After applying this diff:

@@ -1006,7 +1038,9 @@ read_stream_next_buffer(ReadStream *stream, void 
**per_buffer_data)
             stream->oldest_io_index = 0;
 
         /* Look-ahead distance ramps up rapidly after we do I/O. */
-        distance = stream->distance * 2;
+        distance = stream->distance * 2
+            + 1
+            ;
         distance = Min(distance, stream->max_pinned_buffers);
         stream->distance = distance;
 

worker w/ enable_indexscan_prefetch=1 + patch:

 Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
(cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..82.673 
rows=100000.00 loops=1)
   Index Searches: 1
   Prefetch: distance=70.892 count=103109 stalls=5 skipped=0 resets=0 pauses=0 
ungets=3474 forwarded=0
             histogram [1,2) => 88975, [2,4) => 5, [4,8) => 11, [8,16) => 26, 
[16,32) => 28, [32,64) => 64, [64,128) => 104, [128,256) => 136, [256,512) => 
602, [512,1024) => 13158
   Buffers: shared hit=97150 read=3125
   I/O Timings: shared read=19.711
 Planning:
   Buffers: shared hit=97 read=7
   I/O Timings: shared read=0.596
 Planning Time: 0.951 ms
 Execution Time: 87.887 ms

By no means a huge win compared to prefetching being disabled, but the
regression does vanish.


The problem that this fixes is that the periodic workload has cache hits
frequently, which reduce the stream->distance by 1. Then, on a miss, we double
the distance.  But that means that if you have the trivial pattern of one hit
and one miss, which this workload very often has, you *never* get above
1. I.e. we increase the distance as quickly as we decrease it.


Greetings,

Andres Freund


Reply via email to