On 1/5/26 13:21, Konstantin Knizhnik wrote:
>
> On 05/01/2026 2:21 AM, Tomas Vondra wrote:
>> Even a small fixes overhead can be quite visible for extremely short
>> queries that can't possibly benefit from prefetching (like the LIMIT 1).
>>
>> That's what we were concerned about when we invented this heuristics
>> (which only prefetches from the 2nd batch). I agree it's crude, no
>> argument there. I'm not against having something better, as long as we
>> can make it reliable.
>
> I completely agree that we should not create read stream from the very
> beginning despite to small overhead.
> My concern is that second batch is not good criteria: depending on key
> size and position of the first key on the page, it can be too short.
>
>
>> One of the problems with read_stream is that it may look ahead very far
>> ahead. Much further than the query will need. Imagine an index-only
>> scan, with 99.99% pages being all-visible. At some point the scan will
>> hit an item that requires reading the heap page. Which triggers the
>> look-ahead, and a search for the *next* heap page.
>>
>> But that next index entry may be many leafs ahead. And maybe the scan
>> never even gets to actually need that, which means the work will be
>> wasted. If the query only needs a couple tuples, this may cause a bad
>> regression.
>
> Sorry, I do not completely understand how it can happen.
> Read stream can only fetch heap pages which are referenced by TIDs in
> leave pages (leaf=batch).
> So read stream can not advance more than TIDs from currently processes
> leaf page, can it?
>
Sure it can. It can look multiple leaf pages ahead. Allowing that is one
of the points of amgetbatch().
The read_stream is gradually increasing the look-ahead distance, do
let's say it increases distance from 4 to 8. Now the callback needs to
find 4 more blocks to return to the read stream.
For plain index scans, we do skip duplicate blocks. If there's a
sequence of index entries pointing to toples on the same heap page, the
callback will skip those. This can happen with correlated indexes. The
number of tuples on a heap page is limited, so there's an limit of how
bad this can get. But it still adds overhead, and forces the callback to
look further ahead.
For index-only scans it's much worse, because there's no limit on how
far the next not-allvisible page is (and maybe there is none).
FWIW the "second batch" heuristics does not really fix these issues. The
unbounded look-ahead can happen for the later batches, and it's just as
bad. The "good" solution would be to have some feedback about how much
effort was already used (and then yield control, using the WIP patch).
>> IIRC the same thing can happen for plain index scans. You may need to
>> try with correlated indexes (I think the read_stream callback skips
>> duplicate TIDs).
>
> Another problem is that different TIDs can refer to the same heap page
> (which happens with clustered index or table populated in key ascending
> order).
Sure, but master does not check that either. It's not clear to me what
exactly we should do about that. Should we hold on to ping for a couple
recent pages? Or should we just remember the recent blocks, assuming
they're still in cache? Or something else?
> But number of SMGR reads criteria also should work good in this case.
>
It might be one of the inputs for the heuristics. But I still don't know
what should the heuristics look like exactly.
>> When all data is cached in shared buffers, then we do not perform IO at
>> all.
>> It means there it doesn't matter whether and when we initialize
>> read_stream.
>> We can do it after processing 10 items (current default), or immediately
>> - it should not affect performance.
>> And this is what I have tested: performance actually not depends on
>> `read_stream_threshold` (if data fits in shared buffers).
>> At least it is within few percents and may be it is just random
>> fluctuations.
>> Obviously there is no 25% degradation.
>>
>> I'm not sure it's this simple. Even if we perform no physical I/O, the
>> read_stream callback will need to go through the whole batch and pass
>> all the items to the read_stream. The data may be cached, but not in
>> shared buffers, in which case the read_stream will do the IPC etc.
>
> You mean that data is present in OS file cache and reading it is fast
> and will not benefit from AIO?
> It certainly can happen but it seems to be quite obvious problem of
> double buffering.
>
Depends. It can benefit from AIO with io_method=worker, because it can
parallelize the memcpy() and checksum validation.
My point is that the "page cache" is pretty opaque to us. We don't know
if a page was read from disk or page cache, which means we can't really
calculate cache hit ratio for it.
>> It definitely doesn't mean that it is not possible to find scenario
>> where this approach with enabling prefetch after processing N items will
>> show worse performance than master or v5. We just need to properly
>> choose cache hit rate. But the same is true IMHO for v5 itself: it is
>> possible to find workload where it will show the same degradation
>> comparing with master.
>>
>> I'm sure such "adversarial" data sets exist, even if we don't have a
>> good example at hand. The question is how plausible / realistic such
>> data sets are, though.
>>
>> I'm not sure about v5, though. The assumption was the overhead is most
>> visible for short queries, and once we get to the 2nd batch the query is
>> expensive enough to to not be affected too much.
>>
>> When you say "we just need to properly choose cache hit rate", how would
>> we do that? And how would we know/determine the optimal threshold?
>
> Sorry, may be I was unclear.
> Saying about "choosing cache hit rate" I didn't not mean to use it for
> determining proper threshold.
> I was thinking about the best/worst scenario for index prefetch.
> Originally in my "benchmark" I considered case when no heap pages are
> present in shared buffer.
> In your scenario - size of shared buffers is larger than size of table,
> so all pages are present in shared buffers.
> First case is certainly the best scenario for index prefetch where we
> can expect to get the largest improvement.
> In your case there is certainly no improvement, but as we can see -
> overhead is also not so large (just because we do not read any page).
> I expect that the worst result will be in some intermediate case - when
> some pages are present in shared buffer, some - not.
> It will be especially true for "second batch" criteria, because it can
> happen that we need to read just the single heap page for the whole
> batch and using read stream for it will just add extra overhead.
>
True. The case with no data in memory is definitely the best case for
index prefetching. With all data in shared buffers we can measure the
overhead of read_stream (because it won't really do any prefetching).
The in-between cases with a mix of cached / uncached pages can trigger
all kinds of weird behaviors in read_stream / AIO. We've seen cases
where the distance "collapsed" to ~2.0, which maximizes the overhead
with io_method=worker (doing IPC for each block).
You could argue that's more an issue in read_stream, of course. And in
my last round of testing I haven't seen such cases. But I don't think it
got fixed, more likely I haven't generated a suitable data set.
>>> More precise heuristic should IMHO take in account actual number of
>>> performed disk read.
>> I'm not sure "disk reads" is the right term. The data may be in page
>> cache, and we just need to copy them into shared buffers.
>
> I agree, but there seems to be no chance to determine if page was
> actually read from disk or cache.
> But as I wrote above, it is problem of double buffering. If we can avoid
> it (i.e. using direct IO), then there will be no such problem.
>
True, but most users don't use direct I/O.
>>> Please notice that I do not want to predict number of disk reads - i.e.
>>> check if candidates for prefetch are present in shared buffers.
>>> It will really adds significant overhead. I think that it is better to
>>> use as threshold number of performed reads.
>>>
>> Not sure I understand what this says. Can you elaborate?
>
> I sent proposed patch in the previous mail.
> Did you have a chance to look at it?
>
Not yet, I'll take a look.
regards
--
Tomas Vondra