On Tue, Feb 17, 2026 at 2:27 PM Andres Freund <[email protected]> wrote: > On 2026-02-17 12:16:23 -0500, Peter Geoghegan wrote: > > On Mon, Feb 16, 2026 at 11:48 AM Andres Freund <[email protected]> wrote: > > I agree that the current heuristics (which were invented recently) are > > too conservative. I overfit the heuristics to my current set of > > adversarial queries, as a stopgap measure. > > Are you doing any testing on higher latency storage? I found it to be quite > valuable to use dm_delay to have a disk with reproducible (i.e. not cloud) > higher latency (i.e. not just a local SSD).
I sometimes use dm_delay (with the minimum 1ms delay) when testing, but don't do so regularly. Just because it's inconvenient to do so (perhaps not a great reason). > Low latency NVMe can reduce the > penalty of not enough readahead so much that it's hard to spot problems... I'll keep that in mind. > > ISTM that we need the yields to better cooperate with whatever's > > happening on the read stream side. > > Plausible. It could be that we could get away with controlling the rampup to > be slower in potentially problematic cases, without needing the yielding, but > not sure. > > If that doesn't work, it might just be sufficient to increase the number of > batches that trigger yields as the scan goes on (perhaps by taking the number > of already "consumed" batches into account). It could make sense to take the number of consumed batches into account. In general, I think the best approach will be one that combines multiple complementary strategies. Passing down a LIMIT N hint has proven to be a good idea -- and it doesn't really require applying any information related to the read stream. That's enough to prevent problems in the really extreme cases (e.g., nested loop antijoins with a LIMIT 1 on the inner side). The problematic merge join I showed you is a not-so-extreme case, which makes it trickier. ISTM that taking into consideration the number of "consumed" batches will not help that particular merge join query, precisely because it's not-so-extreme: the inner index scan consumes plenty of batches, but is nevertheless significantly regressed (at least when we don't yield at all). > To evaluate the amount of wasted work, it could be useful to make the read > stream stats page spit out the amount of "unconsumed" IOs at the end of the > scan. That would make sense. You can already tell when that's happened by comparing the details shown by EXPLAIN ANALYZE against the same query execution on master, but that approach is inconvenient. Automating my microbenchmarks has proven to be important with this project. There's quite a few competing considerations, and it's too easy to improve one query at the cost of regressing another. > > The main motivation for yielding is to deal with things like merge > > joins fed by at least one plain index scan, and plain scans for an > > "ORDER BY .... LIMIT N" query. > > Would be good to document why the yielding exists more extensively in the > comment above it... I agree that the comments aren't fully worked out. Mostly because my understanding of what's going on here isn't fully worked out. > > I attach an example of where disabling the yield mechanism hurts > > instead of helping, to give you a sense of the problems in this area. > > What data/schema is that? Looks kinda but not really TPC-H like. Attached SQL script generates the same test data. There is a dimension table, which might make it similar to TPC-H, though that wasn't really intentional. > I assume that there are no mark & restores in the query, given that presumably > the inner side is unique? Right; this particular query doesn't use mark and restore. I do have another test query that does use mark and restore (a self-join + range conditions), but so far that doesn't seem to be too much of a problem. We have to reset the read stream when we restore a mark, which creates noticeable overhead. But (it seems) usually not enough added overhead for it to really matter. FWIW when the inner side of a merge join is an index-only scan, and we have to mark + restore a lot, the patch is quite a lot faster than master -- even when everything is cached. We don't have to repeatedly do the same VM lookups on the inner side (we can just use our local VM cache instead). -- Peter Geoghegan
generate_benchmark_data.sql
Description: Binary data
