On Fri, Mar 1, 2024 at 10:18 AM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
> But I have very hard time figuring out what the MVP version should be,
> because I have very limited understanding on how much control the index
> AM ought to have :-( And it'd be a bit silly to do something in v17,
> only to have to rip it out in v18 because it turned out to not get the
> split right.

I suspect that you're overestimating the difficulty of getting the
layering right (at least relative to the difficulty of everything
else).

The executor proper doesn't know anything about pins on leaf pages
(and in reality nbtree usually doesn't hold any pins these days). All
the executor knows is that it had better not be possible for an
in-flight index scan to get confused by concurrent TID recycling by
VACUUM. When amgettuple/btgettuple is called, nbtree usually just
returns TIDs it collected from a just-scanned leaf page.

This sort of stuff already lives in the index AM. It seems to me that
everything at the API and executor level can continue to work in
essentially the same way as it always has, with only minimal revision
to the wording around buffer pins (in fact that really should have
happened back in 2015, as part of commit 2ed5b87f).  The hard part
will be figuring out how to make the physical index scan prefetch
optimally, in a way that balances various considerations. These
include:

* Managing heap prefetch distance.

* Avoiding making kill_prior_tuple significantly less effective
(perhaps the new design could even make it more effective, in some
scenarios, by holding onto multiple buffer pins based on a dynamic
model).

* Figuring out how many leaf pages it makes sense to read ahead of
accessing the heap, since there is no fixed relationship between the
number of leaf pages we need to scan to collect a given number of
distinct heap blocks that we need for prefetching. (This is made more
complicated by things like LIMIT, but is actually an independent
problem.)

So I think that you need to teach index AMs to behave roughly as if
multiple leaf pages were read as one single leaf page, at least in
terms of things like how the BTScanOpaqueData.currPos state is
managed. I imagine that currPos will need to be filled with TIDs from
multiple index pages, instead of just one, with entries that are
organized in a way that preserves the illusion of one continuous scan
from the point of view of the executor proper. By the time we actually
start really returning TIDs via btgettuple, it looks like we scanned
one giant leaf page instead of several (the exact number of leaf pages
scanned will probably have to be indeterminate, because it'll depend
on things like heap prefetch distance).

The good news (assuming that I'm right here) is that you don't need to
have specific answers to most of these questions in order to commit a
v1 of index prefeteching. ISTM that all you really need is to have
confidence that the general approach that I've outlined is the right
approach, long term (certainly not nothing, but I'm at least
reasonably confident here).

> The hard thing is what to do about cases where neither of this helps.
> The example I keep thinking about is IOS - if we don't do prefetching,
> it's not hard to construct cases where regular index scan gets much
> faster than IOS (with many not-all-visible pages). But we can't just
> prefetch all pages, because that'd hurt IOS cases with most pages fully
> visible (when we don't need to actually access the heap).
>
> I managed to deal with this in the executor-level version, but I'm not
> sure how to do this if the control moves closer to the index AM.

The reality is that nbtree already knows about index-only scans. It
has to, because it wouldn't be safe to drop the pin on a leaf page's
buffer when the scan is "between pages" in the specific case of
index-only scans (so the _bt_killitems code path used when
kill_prior_tuple has index tuples to kill knows about index-only
scans).

I actually added commentary to the nbtree README that goes into TID
recycling by VACUUM not too long ago. This includes stuff about how
LP_UNUSED items in the heap are considered dead to all index scans
(which can actually try to look at a TID that just became LP_UNUSED in
the heap!), even though LP_UNUSED items don't prevent VACUUM from
setting heap pages all-visible. This seemed like the only way of
explaining the _bt_killitems IOS issue, that actually seemed to make
sense.

What you really want to do here is to balance costs and benefits.
That's just what's required. The fact that those costs and benefits
span multiple levels of abstractions makes it a bit awkward, but
doesn't (and can't) change the basic shape of the problem.

--
Peter Geoghegan


Reply via email to