On 2/15/26 23:39, Andres Freund wrote:
> Hi,
> 
> On 2026-02-15 22:17:05 +0100, Tomas Vondra wrote:
>> I don't have access to a M1 machine (and it also does not say what type
>> of storage is it using, which seems pretty important for a patch aiming
>> to improve I/O behavior). But I tried running this on my ryzen machine
>> with local SSDs (in RAID0), and with the 100k rows (and fixed handling
>> of page cache) I get this:
>>
>>   column_name io_method  evict  n  master_ms  off_ms  on_ms  effect_pct
>>      periodic    worker    off 10       35.8    35.1   36.5         2.0
>>      periodic    worker     os 10       49.4    49.9   58.8         8.1
>>      periodic    worker     pg 10       39.5    39.9   47.1         8.3
>>        random    worker    off 10       35.9    35.6   35.7         0.2
>>        random    worker     os 10       49.0    49.0   42.6        -7.0
>>        random    worker     pg 10       39.6    39.9   40.9         1.2
>>    sequential    worker    off 10       28.2    27.9   27.7        -0.4
>>    sequential    worker     os 10       39.3    39.2   34.8        -6.0
>>    sequential    worker     pg 10       30.1    30.1   29.4        -1.3
>>
>>   column_name io_method  evict  n  master_ms  off_ms  on_ms  effect_pct
>>      periodic  io_uring    off 10       35.9    35.8   35.8        -0.1
>>      periodic  io_uring     os 10       49.3    49.9   50.0         0.1
>>      periodic  io_uring     pg 10       40.1    39.8   41.7         2.4
>>        random  io_uring    off 10       35.6    35.2   35.7         0.8
>>        random  io_uring     os 10       49.1    48.9   46.1        -3.0
>>        random  io_uring     pg 10       39.8    40.1   42.6         3.1
>>    sequential  io_uring    off 10       28.0    27.8   28.0         0.4
>>    sequential  io_uring     os 10       39.8    39.1   40.7         1.9
>>    sequential  io_uring     pg 10       30.2    30.0   29.6        -0.8
>>
>> This is on default config with io_workers=12 and data_checksums=off. I'm
>> not showing results for parallel query, because it's irrelevant.
>>
>> This also has timings for master, for worker and io_uring (which you
>> could not get on M1, at least no in MacOS). For "worker" the differences
>> are much smaller (within 10% in the worst case), and almost non-existent
>> for io_uring. Which suggests this is likely due to the "signal" overhead
>> associated with worker, which can be annoying for certain data patterns
>> (where we end up issuing an I/O for individual blocks at distance 1).
> 
> I don't think this is just the signalling issue. For "periodic" I think it's
> the signalling issue triggered by the read stream distance being kept too
> low. Due to the small distance, the latency affects us much more.
> 
> Any my system, with turbo boost etc disabled.
> 
> worker w/ enable_indexscan_prefetch=0:
> 
>  Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
> (cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..84.129 
> rows=100000.00 loops=1)
>    Index Searches: 1
>    Buffers: shared hit=97150 read=3125
>    I/O Timings: shared read=31.274
>  Planning:
>    Buffers: shared hit=97 read=7
>    I/O Timings: shared read=0.595
>  Planning Time: 0.944 ms
>  Execution Time: 89.319 ms
> 
> 
> worker w/ enable_indexscan_prefetch=1:
> 
>  Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
> (cost=0.29..15101.09 rows=100000 width=208) (actual time=0.158..115.279 
> rows=100000.00 loops=1)
>    Index Searches: 1
>    Prefetch: distance=1.060 count=99635 stalls=3004 skipped=0 resets=0 
> pauses=0 ungets=0 forwarded=0
>              histogram [1,2) => 93627, [2,4) => 6008
>    Buffers: shared hit=97150 read=3125
>    I/O Timings: shared read=56.077
>  Planning:
>    Buffers: shared hit=97 read=7
>    I/O Timings: shared read=0.612
>  Planning Time: 0.994 ms
>  Execution Time: 120.575 ms
> 
> Right, a regression. But note how low the distance is - no wonder the worker
> latency has a bad effect - we only have the downside, never the upside, as
> there's pretty much no IO concurrency.
> 
> 
> After applying this diff:
> 
> @@ -1006,7 +1038,9 @@ read_stream_next_buffer(ReadStream *stream, void 
> **per_buffer_data)
>              stream->oldest_io_index = 0;
>  
>          /* Look-ahead distance ramps up rapidly after we do I/O. */
> -        distance = stream->distance * 2;
> +        distance = stream->distance * 2
> +            + 1
> +            ;
>          distance = Min(distance, stream->max_pinned_buffers);
>          stream->distance = distance;
>  
> 
> worker w/ enable_indexscan_prefetch=1 + patch:
> 
>  Index Scan using idx_periodic_100000 on prefetch_test_data_100000  
> (cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..82.673 
> rows=100000.00 loops=1)
>    Index Searches: 1
>    Prefetch: distance=70.892 count=103109 stalls=5 skipped=0 resets=0 
> pauses=0 ungets=3474 forwarded=0
>              histogram [1,2) => 88975, [2,4) => 5, [4,8) => 11, [8,16) => 26, 
> [16,32) => 28, [32,64) => 64, [64,128) => 104, [128,256) => 136, [256,512) => 
> 602, [512,1024) => 13158
>    Buffers: shared hit=97150 read=3125
>    I/O Timings: shared read=19.711
>  Planning:
>    Buffers: shared hit=97 read=7
>    I/O Timings: shared read=0.596
>  Planning Time: 0.951 ms
>  Execution Time: 87.887 ms
> 
> By no means a huge win compared to prefetching being disabled, but the
> regression does vanish.
> 
> 
> The problem that this fixes is that the periodic workload has cache hits
> frequently, which reduce the stream->distance by 1. Then, on a miss, we double
> the distance.  But that means that if you have the trivial pattern of one hit
> and one miss, which this workload very often has, you *never* get above
> 1. I.e. we increase the distance as quickly as we decrease it.
> 

I've described this exact behavior a couple months ago in this very
thread. The queue of batches has limited size, the prefetch needs to be
paused - at that point read_stream_pause did not exist, so it was done
by terminating the stream and then read_stream_reset to "resume" it.

That has the unfortunate effect that it resets distance to 1, and so it
easily led exactly to the issue you describe. But without the pausing it
would work perfectly fine, in many cases.

The way I'm thinking about it is that each data set with a uniform cache
miss rate has a distance threshold for stability. Let's say the miss
rate is 1%. That means that we'll do an I/O and then the next 99 blocks
are cache hits. So we double the distance, and then decay it by -1. To
make this stable, these two things need to balance. If we start at
distance=99, we'll go 198->99->198->99->198->... But if we start just a
little bit lower, at 98, we'll quickly decay to distance=1. This is what
I call the "stability threshold".

With the resetting, this effect is pretty brutal. The queries often have
a period of increased "cache misses" at the beginning (because caches
are cold). The distance can "climb up" above the threshold, and keep up
sufficient prefetch distance. But the stream reset sets the distance=1,
without the cache misses to push it up again.

Having read_stream_pause/resume solves this particular issue, but the
overall issue of "stability threshold" still exists. Even when the
distance gets adjusted differently (e.g. 2*d + 1), it'll still be
trivial to construct "nice" data sets that narrowly miss the threshold.
I don't see a way around that, unfortunately (short of not decaying the
distance at all).

I think a "proper" solution would require some sort of cost model for
the I/O part, so that we can schedule the I/Os just so that the I/O
completes right before we actually need the page.

A trivial super-simplified example: let's say reading a page costs 100
coins, and processing one tuple (fetched from the index scan + heap)
takes 1 coin. Then we need to schedule ~100 IOs ahead, because an IO
we'll need in ~100 cycles should be just about complete.

Of course, that's a super simplified model. The point is it needs some
concept of actual I/O costs, I don't think there's a perfect formula
considering only the distance itself.


regards

-- 
Tomas Vondra



Reply via email to