On Tue, Feb 17, 2026 at 2:27 PM Andres Freund <[email protected]> wrote:
> On 2026-02-17 12:16:23 -0500, Peter Geoghegan wrote:
> > On Mon, Feb 16, 2026 at 11:48 AM Andres Freund <[email protected]> wrote:
> > I agree that the current heuristics (which were invented recently) are
> > too conservative. I overfit the heuristics to my current set of
> > adversarial queries, as a stopgap measure.
>
> Are you doing any testing on higher latency storage?  I found it to be quite
> valuable to use dm_delay to have a disk with reproducible (i.e. not cloud)
> higher latency (i.e. not just a local SSD).

I sometimes use dm_delay (with the minimum 1ms delay) when testing,
but don't do so regularly. Just because it's inconvenient to do so
(perhaps not a great reason).

> Low latency NVMe can reduce the
> penalty of not enough readahead so much that it's hard to spot problems...

I'll keep that in mind.

> > ISTM that we need the yields to better cooperate with whatever's
> > happening on the read stream side.
>
> Plausible.  It could be that we could get away with controlling the rampup to
> be slower in potentially problematic cases, without needing the yielding, but
> not sure.
>
> If that doesn't work, it might just be sufficient to increase the number of
> batches that trigger yields as the scan goes on (perhaps by taking the number
> of already "consumed" batches into account).

It could make sense to take the number of consumed batches into
account. In general, I think the best approach will be one that
combines multiple complementary strategies.

Passing down a LIMIT N hint has proven to be a good idea -- and it
doesn't really require applying any information related to the read
stream. That's enough to prevent problems in the really extreme cases
(e.g., nested loop antijoins with a LIMIT 1 on the inner side). The
problematic merge join I showed you is a not-so-extreme case, which
makes it trickier. ISTM that taking into consideration the number of
"consumed" batches will not help that particular merge join query,
precisely because it's not-so-extreme: the inner index scan consumes
plenty of batches, but is nevertheless significantly regressed (at
least when we don't yield at all).

> To evaluate the amount of wasted work, it could be useful to make the read
> stream stats page spit out the amount of "unconsumed" IOs at the end of the
> scan.

That would make sense. You can already tell when that's happened by
comparing the details shown by EXPLAIN ANALYZE against the same query
execution on master, but that approach is inconvenient. Automating my
microbenchmarks has proven to be important with this project. There's
quite a few competing considerations, and it's too easy to improve one
query at the cost of regressing another.

> > The main motivation for yielding is to deal with things like merge
> > joins fed by at least one plain index scan, and plain scans for an
> > "ORDER BY .... LIMIT N" query.
>
> Would be good to document why the yielding exists more extensively in the
> comment above it...

I agree that the comments aren't fully worked out. Mostly because my
understanding of what's going on here isn't fully worked out.

> > I attach an example of where disabling the yield mechanism hurts
> > instead of helping, to give you a sense of the problems in this area.
>
> What data/schema is that? Looks kinda but not really TPC-H like.

Attached SQL script generates the same test data. There is a dimension
table, which might make it similar to TPC-H, though that wasn't really
intentional.

> I assume that there are no mark & restores in the query, given that presumably
> the inner side is unique?

Right; this particular query doesn't use mark and restore.

I do have another test query that does use mark and restore (a
self-join + range conditions), but so far that doesn't seem to be too
much of a problem. We have to reset the read stream when we restore a
mark, which creates noticeable overhead. But (it seems) usually not
enough added overhead for it to really matter.

FWIW when the inner side of a merge join is an index-only scan, and we
have to mark + restore a lot, the patch is quite a lot faster than
master -- even when everything is cached. We don't have to repeatedly
do the same VM lookups on the inner side (we can just use our local VM
cache instead).

-- 
Peter Geoghegan

Attachment: generate_benchmark_data.sql
Description: Binary data

Reply via email to