Re: index prefetching

Peter Geoghegan Tue, 22 Jul 2025 18:32:34 -0700

On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra <[email protected]> wrote:
> > I happen to think that that's a very unrealistic assumption. Most
> > standard benchmarks have indexes that almost all look fairly similar
> > to pgbench_accounts_pkey, from the point of view of "heap page blocks
> > per leaf page". There are exceptions, of course (e.g., the TPC-C order
> > table's primary key suffers from fragmentation).
> >
>
> I agree with all of this.


Cool.

> I assume you mean results for the "linear" data set, because for every
> other data set the patches perform almost exactly the same (when
> restoring the distance after stream reset):
>
> https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf

Right.

> And it's a very good point. I was puzzled by this too for a while, and
> it took me a while to understand how/why this happens. It pretty much
> boils down to the "duplicate block" detection and how it interacts with
> the stream resets (again!).

I think that you slightly misunderstand where I'm coming from here: it
*doesn't* puzzle me. What puzzled me was that it puzzled you.

Andres' test query is very simple, and not entirely sympathetic
towards the complex patch (by design). And yet it *also* gets quite a
decent improvement from the complex patch. It doesn't speed things up
by another order of magnitude or anything, but it's a very decent
improvement -- one well worth having.

I'm also unsurprised at the fact that all the other tests that you ran
were more or less a draw between simple and complex. At least not now
that I've drilled down and understood what the indexes from those
other test cases actually look like, in practice.

> So you're right the complex patch prefetches far ahead. I thought the
> distance will quickly decrease because of the duplicate blocks, but I
> missed the fact the read stream will not seem them at all.

FWIW I wasn't thinking about it at anything like that level of
sophistication. Everything I've said about it was based on intuitions
about how the prefetching was bound to work, for each different kind
of index. I just looked at individual leaf pages (or small groups of
them) from each index/test, and considered their TIDs, and imagined
how that was likely to affect the scan.

It just seems obvious to me that all the tests (except for "linear")
couldn't possibly be helped by eagerly reading multiple leaf pages. It
seemed equally obvious that it's quite possible to come up with a
suite of tests that have several tests that could benefit in the same
way (not just 1). Although your "linear_1"/"linear_N" tests aren't
actually like that, many cases will be -- and not just those that are
perfectly correlated ala pgbench.

> I'm not sure it's desirable to "hide" blocks from the read stream like
> this - it'll never see the misses. How could it make good decisions,
> when we skew the data used by the heuristics like this?

I don't think that I fully understand what's desirable here myself.

> > Doing *no* prefetching will usually be the right thing to do. Does
> > that make index prefetching pointless in general?
> >
>
> I don't think so. Why would it? There's plenty of queries that can
> benefit from it a lot, and as long as it doesn't cause harm to other
> queries it's a win.

I was being sarcastic. That wasn't a useful thing for me to do. Apologies.

> This is not resetting the stream, though. This is resetting the position
> tracking how far the stream got.

My main point is that there's stuff going on here that nobody quite
understands just yet. And so it probably makes sense to defensively
assume that the prefetch distance resetting stuff might matter with
either the complex or simple patch.

> Sorry, got distracted and forgot to complete the sentence. I think I
> wanted to write "mostly from not resetting the distance to 1". Which is
> true, but the earlier "linear" example also shows there are cases where
> the page boundaries are significant.

Of course that's true. But that was just a temporary defect of the
"simple" patch (and perhaps even for the "complex" patch, albeit to a
much lesser degree). It isn't really relevant to the important
question of whether the simple or complex design should be pursued --
we know that now.

As I said, I don't think that the test suite is particularly well
suited to evaluating simple vs complex. Because there's only one test
("linear") that has any hope of being better with the complex patch.
And because having only 1 such test isn't representative.

> That's actually intentional. I wanted to model tables with wider tuples,
> without having to generate all the data etc. Maybe 25% is too much, and
> real table have more than 20 tuples. It's true 400B is fairly large.

My point about fill factor isn't particularly important.

> I'm not against testing with other parameters, of course. The test was
> not originally written for comparing different prefetching patches, so
> it may not be quite fair (and I'm not sure how to define "fair").

I'd like to see more than 1 test where eagerly reading leaf pages has
any hope of helping. That's my only important concern.

> It's not uniformly random, I wrote it uses normal distribution. The
> query in the SQL script does this:
>
>   select x + random_normal(0, 1000) from ...
>
> It is a synthetic test data set, of course. It's meant to be simple to
> generate, reason about, and somewhere in between the "linear" and
> "uniform" data sets.

I always start by looking at the index leaf pages, and imagining how
an index scan can/will deal with that.

Just because it's not truly uniformly random doesn't mean that that's
apparent when you just look at one leaf page -- heap blocks might very
well *appear* to be uniformly random (or close to it) when you drill
down like that. Or even when you look at (say) 50 neighboring leaf
pages.

> But it also has realistic motivation - real tables are usually not as
> clean as "linear", nor as random as the "uniform" data sets (not for all
> columns, at least). If you're looking at data sets like "orders" or
> whatever, there's usually a bit of noise even for columns like "date"
> etc. People modify the orders, or fill-in data from a couple days ago,
> etc. Perfect correlation for one column implies slightly worse
> correlation for another column (order date vs. delivery date).

I agree.

> Right. I don't see a problem with this. I'm not saying parameters for
> this particular data set are "perfect", but the intent is to have a
> range of data sets from "perfectly clean" to "random" and see how the
> patch(es) behave on all of them.

Obviously none of your test cases are invalid -- they're all basically
reasonable, when considered in isolation. But the "linear_1" test is
*far* closer to the "uniform" test than it is to the "linear" test. At
least as far as the simple vs complex question is concerned.

> If you have a suggestion for different data sets, or how to tweak the
> parameters to make it more realistic, I'm happy to try those.

I'll get back to you on this soon. There are plenty of indexes that
are not perfectly correlated (like pgbench_accounts_pkey is) that'll
nevertheless benefit significantly from the approach taken by the
complex patch. I'm sure of this because I've been using the query I
posted early for many years now -- I've thought about and directly
instrumented the "nhtids:nhblks" of an index of interest many times in
the past.

Thanks
-- 
Peter Geoghegan

Re: index prefetching

Reply via email to