Hi,
On 2025-08-12 18:53:13 +0200, Tomas Vondra wrote:
> I'm running some tests looking for these weird changes, not just with
> the patches, but on master too. And I don't think b4212231 changed the
> situation very much.
>
> FWIW this issue is not caused by the index prefetching patches, I can
> reproduce it with master (on b227b0bb4e032e19b3679bedac820eba3ac0d1cf
> from yesterday). So maybe we should split this into a separate thread.
>
> Consider for example the dataset built by create.sql - it's randomly
> generated, but the idea is that it's correlated, but not perfectly. The
> table is ~3.7GB, and it's a cold run - caches dropped + restart).
>
> Anyway, a simple range query look like this:
>
> EXPLAIN (ANALYZE, COSTS OFF)
> SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Index Scan using idx on t
> (actual time=0.584..433.208 rows=1048576.00 loops=1)
> Index Cond: ((a >= 16336) AND (a <= 49103))
> Index Searches: 1
> Buffers: shared hit=7435 read=50872
> I/O Timings: shared read=332.270
> Planning:
> Buffers: shared hit=78 read=23
> I/O Timings: shared read=2.254
> Planning Time: 3.364 ms
> Execution Time: 463.516 ms
> (10 rows)
>
> EXPLAIN (ANALYZE, COSTS OFF)
> SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Index Scan Backward using idx on t
> (actual time=0.566..22002.780 rows=1048576.00 loops=1)
> Index Cond: ((a >= 16336) AND (a <= 49103))
> Index Searches: 1
> Buffers: shared hit=36131 read=50872
> I/O Timings: shared read=21217.995
> Planning:
> Buffers: shared hit=82 read=23
> I/O Timings: shared read=2.375
> Planning Time: 3.478 ms
> Execution Time: 22231.755 ms
> (10 rows)
>
> That's a pretty massive difference ... this is on my laptop, and the
> timing changes quite a bit, but it's always a multiple of the first
> query with forward scan.
I suspect what you're mainly seeing here is that the OS can do readahead for
us for forward scans, but not for backward scans. Indeed, if I look at
iostat, the forward scan shows:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s
wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm
d_await dareq-sz f/s f_await aqu-sz %util
nvme6n1 3352.00 400.89 0.00 0.00 0.18 122.47 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.62 47.90
whereas the backward scan shows:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s
wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm
d_await dareq-sz f/s f_await aqu-sz %util
nvme6n1 10958.00 85.57 0.00 0.00 0.06 8.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.69 63.80
Note the different read sizes...
> I did look into pg_aios, but there's only 8kB requests in both cases. I
> didn't have time to look closer yet.
That's what we'd expect, right? There's nothing on master that'd perform read
combining for index scans...
Greetings,
Andres Freund