Hi,

On 2025-07-22 19:13:23 -0400, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 6:53 PM Andres Freund <and...@anarazel.de> wrote:
> > That may be true with local fast NVMe disks, but won't be true for networked
> > storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU
> > cycles for actual processing of the data.
>
> I don't understand why it wouldn't be a problem for NVMe disks, too.

> Take a range scan on pgbench_accounts_pkey, for example -- something
> like your ORDER BY ... LIMIT N test case, but with pgbench data
> instead of TPC-H data. There are 6 heap blocks per leaf page. As I
> understand it, the simple patch will only be able to see up to 6 heap
> blocks "into the future", at any given time. Why isn't that quite a
> significant drawback, regardless of the underlying storage?

My response was specific to Tomas' comment that for many queries, which tend
to be more complicated than the toys we are using here, there will be CPU
costs in the query.

E.g. on my local NVMe SSD I get about 55k IOPS with an iodepth of 6 (that's
without stalls between leaf pages, so not really correct, but it's too much
math for me to compute).  If you have 6 heap blocks referenced per index
block, with 60 tuples on those heap pages and you can get 55k iops with that,
you can fetch 20 million tuples / second. If per-tuple CPU processing takes
longer 10**9/20_000_000 = 50 nanoseconds, you'll not be bottlenecked on
storage.

E.g. for this silly query:
SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 
10000000);

while also using io_combine_limit=1 (to actually see the achieved IO depth), I
see an achieved IO depth of ~6.3 (complex).

Whereas this even sillier query:
SELECT max(abalance), min(abalance), sum(abalance::numeric), 
avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM (SELECT * 
FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
only achieves an IO depth of ~4.1 (complex).


                    cheaper query       expensive query
simple readahead    8723.209 ms         10615.232 ms
complex readahead   5069.438 ms          8018.347 ms

Obviously the CPU overhead in this example didn't completely eliminate the IO
bottleneck, but sure reduced the difference.

If your assumption is that real queries are more CPU intensive that the toy
stuff above, e.g. due to joins etc, you can see why the really attained IO
depth is lower.



Btw, something with the batching is off with the complex patch.  I was
wondering why I was not seing 100% CPU usage while also not seeing very deep
queues - and I get deeper queues and better times with a lowered
INDEX_SCAN_MAX_BATCHES and worse with a higher one.

Greetings,

Andres Freund


Reply via email to