On Fri, Mar 1, 2024 at 10:18 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > But I have very hard time figuring out what the MVP version should be, > because I have very limited understanding on how much control the index > AM ought to have :-( And it'd be a bit silly to do something in v17, > only to have to rip it out in v18 because it turned out to not get the > split right.
I suspect that you're overestimating the difficulty of getting the layering right (at least relative to the difficulty of everything else). The executor proper doesn't know anything about pins on leaf pages (and in reality nbtree usually doesn't hold any pins these days). All the executor knows is that it had better not be possible for an in-flight index scan to get confused by concurrent TID recycling by VACUUM. When amgettuple/btgettuple is called, nbtree usually just returns TIDs it collected from a just-scanned leaf page. This sort of stuff already lives in the index AM. It seems to me that everything at the API and executor level can continue to work in essentially the same way as it always has, with only minimal revision to the wording around buffer pins (in fact that really should have happened back in 2015, as part of commit 2ed5b87f). The hard part will be figuring out how to make the physical index scan prefetch optimally, in a way that balances various considerations. These include: * Managing heap prefetch distance. * Avoiding making kill_prior_tuple significantly less effective (perhaps the new design could even make it more effective, in some scenarios, by holding onto multiple buffer pins based on a dynamic model). * Figuring out how many leaf pages it makes sense to read ahead of accessing the heap, since there is no fixed relationship between the number of leaf pages we need to scan to collect a given number of distinct heap blocks that we need for prefetching. (This is made more complicated by things like LIMIT, but is actually an independent problem.) So I think that you need to teach index AMs to behave roughly as if multiple leaf pages were read as one single leaf page, at least in terms of things like how the BTScanOpaqueData.currPos state is managed. I imagine that currPos will need to be filled with TIDs from multiple index pages, instead of just one, with entries that are organized in a way that preserves the illusion of one continuous scan from the point of view of the executor proper. By the time we actually start really returning TIDs via btgettuple, it looks like we scanned one giant leaf page instead of several (the exact number of leaf pages scanned will probably have to be indeterminate, because it'll depend on things like heap prefetch distance). The good news (assuming that I'm right here) is that you don't need to have specific answers to most of these questions in order to commit a v1 of index prefeteching. ISTM that all you really need is to have confidence that the general approach that I've outlined is the right approach, long term (certainly not nothing, but I'm at least reasonably confident here). > The hard thing is what to do about cases where neither of this helps. > The example I keep thinking about is IOS - if we don't do prefetching, > it's not hard to construct cases where regular index scan gets much > faster than IOS (with many not-all-visible pages). But we can't just > prefetch all pages, because that'd hurt IOS cases with most pages fully > visible (when we don't need to actually access the heap). > > I managed to deal with this in the executor-level version, but I'm not > sure how to do this if the control moves closer to the index AM. The reality is that nbtree already knows about index-only scans. It has to, because it wouldn't be safe to drop the pin on a leaf page's buffer when the scan is "between pages" in the specific case of index-only scans (so the _bt_killitems code path used when kill_prior_tuple has index tuples to kill knows about index-only scans). I actually added commentary to the nbtree README that goes into TID recycling by VACUUM not too long ago. This includes stuff about how LP_UNUSED items in the heap are considered dead to all index scans (which can actually try to look at a TID that just became LP_UNUSED in the heap!), even though LP_UNUSED items don't prevent VACUUM from setting heap pages all-visible. This seemed like the only way of explaining the _bt_killitems IOS issue, that actually seemed to make sense. What you really want to do here is to balance costs and benefits. That's just what's required. The fact that those costs and benefits span multiple levels of abstractions makes it a bit awkward, but doesn't (and can't) change the basic shape of the problem. -- Peter Geoghegan