Hi, On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote: > At pgcon unconference I presented a PoC patch adding prefetching for > indexes, along with some benchmark results demonstrating the (pretty > significant) benefits etc. The feedback was quite positive, so let me > share the current patch more widely.
I'm really excited about this work. > 1) pairing-heap in GiST / SP-GiST > > For most AMs, the index state is pretty trivial - matching items from a > single leaf page. Prefetching that is pretty trivial, even if the > current API is a bit cumbersome. > > Distance queries on GiST and SP-GiST are a problem, though, because > those do not just read the pointers into a simple array, as the distance > ordering requires passing stuff through a pairing-heap :-( > > I don't know how to best deal with that, especially not in the simple > API. I don't think we can "scan forward" stuff from the pairing heap, so > the only idea I have is actually having two pairing-heaps. Or maybe > using the pairing heap for prefetching, but stashing the prefetched > pointers into an array and then returning stuff from it. > > In the patch I simply prefetch items before we add them to the pairing > heap, which is good enough for demonstrating the benefits. I think it'd be perfectly fair to just not tackle distance queries for now. > 2) prefetching from executor > > Another question is whether the prefetching shouldn't actually happen > even higher - in the executor. That's what Andres suggested during the > unconference, and it kinda makes sense. That's where we do prefetching > for bitmap heap scans, so why should this happen lower, right? Yea. I think it also provides potential for further optimizations in the future to do it at that layer. One thing I have been wondering around this is whether we should not have split the code for IOS and plain indexscans... > 4) per-leaf prefetching > > The code is restricted only prefetches items from one leaf page. If the > index scan needs to scan multiple (many) leaf pages, we have to process > the first leaf page first before reading / prefetching the next one. > > I think this is acceptable limitation, certainly for v0. Prefetching > across multiple leaf pages seems way more complex (particularly for the > cases using pairing heap), so let's leave this for the future. Hm. I think that really depends on the shape of the API we end up with. If we move the responsibility more twoards to the executor, I think it very well could end up being just as simple to prefetch across index pages. > 5) index-only scans > > I'm not sure what to do about index-only scans. On the one hand, the > point of IOS is not to read stuff from the heap at all, so why prefetch > it. OTOH if there are many allvisible=false pages, we still have to > access that. And if that happens, this leads to the bizarre situation > that IOS is slower than regular index scan. But to address this, we'd > have to consider the visibility during prefetching. That should be easy to do, right? > Benchmark / TPC-H > ----------------- > > I ran the 22 queries on 100GB data set, with parallel query either > disabled or enabled. And I measured timing (and speedup) for each query. > The speedup results look like this (see the attached PDF for details): > > query serial parallel > 1 101% 99% > 2 119% 100% > 3 100% 99% > 4 101% 100% > 5 101% 100% > 6 12% 99% > 7 100% 100% > 8 52% 67% > 10 102% 101% > 11 100% 72% > 12 101% 100% > 13 100% 101% > 14 13% 100% > 15 101% 100% > 16 99% 99% > 17 95% 101% > 18 101% 106% > 19 30% 40% > 20 99% 100% > 21 101% 100% > 22 101% 107% > > The percentage is (timing patched / master, so <100% means faster, >100% > means slower). > > The different queries are affected depending on the query plan - many > queries are close to 100%, which means "no difference". For the serial > case, there are about 4 queries that improved a lot (6, 8, 14, 19), > while for the parallel case the benefits are somewhat less significant. > > My explanation is that either (a) parallel case used a different plan > with fewer index scans or (b) the parallel query does more concurrent > I/O simply by using parallel workers. Or maybe both. > > There are a couple regressions too, I believe those are due to doing too > much prefetching in some cases, and some of the heuristics mentioned > earlier should eliminate most of this, I think. I'm a bit confused by some of these numbers. How can OS-level prefetching lead to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? Unless I missed what "xeon / cached (speedup)" indicates? I think it'd be good to run a performance comparison of the unpatched vs patched cases, with prefetching disabled for both. It's possible that something in the patch caused unintended changes (say spilling during a hashagg, due to larger struct sizes). Greetings, Andres Freund