On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <to...@vondra.me> wrote: > I've been looking at this patch over the couple last days, mostly doing > some stress testing / benchmarking (hence the earlier report) and basic > review.
Thanks for taking a look! Very helpful. > I do have some initial review comments, and the testing produced > some interesting regressions (not sure if those are the cases where > skipscan can't really help, that Peter mentioned he needs to look into). The one type of query that's clearly regressed in a way that's just not acceptable are queries where we waste CPU cycles during scans where it's truly hopeless. For example, I see a big regression on one of the best cases for the Postgres 17 work, described here: https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans#a-practical-example-3x-performance-improvement Notably, these cases access exactly the same buffers/pages as before, so this really isn't a matter of "doing too much skipping". The number of buffers hit exactly matches what you'll see on Postgres 17. It's just that we waste too many CPU cycles in code such as _bt_advance_array_keys, to uselessly maintain skip arrays. I'm not suggesting that there won't be any gray area with these regressions -- nothing like this will ever be that simple. But it seems to me like I should go fix these obviously-not-okay cases next, and then see where that leaves everything else, regressions-wise. That seems likely to be the most efficient way of dealing with the regressions. So I'll start there. That said, I *would* be surprised if you found a regression in any query that simply didn't receive any new scan key transformations in new preprocessing code in places like _bt_decide_skipatts and _bt_skip_preproc_shrink. I see that many of the queries that you're using for your stress-tests "aren't really testing skip scan", in this sense. But I'm hardly about to tell you that you shouldn't spend time on such queries -- that approach just discovered a bug affecting Postgres 17 (that was also surprising, but it still happened!). My point is that it's worth being aware of which test queries actually use skip arrays in the first place -- it might help you with your testing. There are essentially no changes to _bt_advance_array_keys that'll affect traditional SAOP arrays (with the sole exception of changes made by v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch, which affect every kind of array in the same way). > 1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch > > - I find the places that increment "nsearches" a bit random. Each AM > does it in entirely different place (at least it seems like that to me). > Is there a way make this a bit more consistent? >From a mechanical perspective there is nothing at all random about it: we do this at precisely the same point that we currently call pgstat_count_index_scan, which in each index AM maps to one descent of the index. It is at least consistent. Whenever a B-Tree index scan shows "Index Scans: N", you'll see precisely the same number by swapping it with an equivalent contrib/btree_gist-based GiST index and running the same query again (assuming that index tuples that match the array keys are spread apart in both the B-Tree and GiST indexes). (Though I see problems with the precise place that nbtree calls pgstat_count_index_scan right now, at least in certain edge-cases, which I discuss below in response to your questions about that.) > uint64 btps_nsearches; /* instrumentation */ > > Instrumentation what? What's the counter for? Will fix. In case you missed it, there is another thread + CF Entry dedicated to discussing this instrumentation patch: https://commitfest.postgresql.org/49/5183/ https://www.postgresql.org/message-id/flat/cah2-wzkrqvaqr2ctnqtzp0z6ful4-3ed6eqb0yx38xbnj1v...@mail.gmail.com > - I see _bt_first moved the pgstat_count_index_scan, but doesn't that > mean we skip it if the earlier code does "goto readcomplete"? Shouldn't > that still count as an index scan? In my opinion, no, it should not. We're counting the number of times we'll have descended the tree using _bt_search (or using _bt_endpoint, perhaps), which is a precisely defined physical cost. A little like counting the number of buffers accessed. I actually think that this aspect of how we call pgstat_count_index_scan is a bug that should be fixed, with the fix backpatched to Postgres 17. Right now, we see completely different counts for a parallel index scan, compared to an equivalent serial index scan -- differences that cannot be explained as minor differences caused by parallel scan implementation details. I think that it's just wrong right now, on master, since we're simply not counting the thing that we're supposed to be counting (not reliably, not if it's a parallel index scan). > - show_indexscan_nsearches does this: > > if (scanDesc && scanDesc->nsearches > 0) > ExplainPropertyUInteger("Index Searches", NULL, > scanDesc->nsearches, es); > > But shouldn't it divide the count by nloops, similar to (for example) > show_instrumentation_count? I can see arguments for and against doing it that way. It's ambiguous/subjective, but on balance I favor not dividing by nloops. You can make a similar argument for doing this with "Buffers: ", and yet we don't divide by nloops there, either. Honestly, I just want to find a way to do this that everybody can live with. Better documentation could help here. > 2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch > > - Admittedly very subjective, but I find the "oppoDirCheck" abbreviation > rather weird, I'd just call it "oppositeDirCheck". Will fix. > 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch > > - nothing Great. I think that I should be able to commit this one soon, since it's independently useful work. > 4) v6-0004-Add-skip-scan-to-nbtree.patch > > - indices.sgml seems to hahve typo "Intevening" -> "Intervening" > > - It doesn't seem like a good idea to remove the paragraph about > multicolumn indexes and replace it with just: > > Multicolumn indexes should be used judiciously. > > I mean, what does judiciously even mean? what should the user consider > to be judicious? Seems rather unclear to me. Admittedly, the old text > was not much helpful, but at least it gave some advice. Yeah, this definitely needs more work. > But maybe more importantly, doesn't skipscan apply only to a rather > limited subset of data types (that support increment/decrement)? Doesn't > the new wording mostly ignore that, implying skipscan applies to all > btree indexes? I don't think it mentions datatypes anywhere, but there > are many indexes on data types like text, UUID and so on. Actually, no, skip scan works in almost the same way with all data types. Earlier versions of the patch didn't support every data type (perhaps I should have waited for that before posting my v1), but the version of the patch you looked at has no restrictions on any data type. You must be thinking of whether or not an opclass has skip support. That's just an extra optimization, which can be used for a small handful of discrete data types such as integer and date (hard to imagine how skip support could ever be implemented for types like numeric and text). There is a temporary testing GUC that will allow you to get a sense of how much skip support can help: try "set skipscan_skipsupport_enabled=off" with (say) my original MDAM test query to get a sense of that. You'll see more buffer hits needed for "next key probes", though not dramatically more. It's worth having skip support (the idea comes from the MDAM paper), but it's not essential. Whether or not an opclass has skip support isn't accounted for by the cost model, but I doubt that it's worth addressing (the cost model is already pessimistic). > - Very subjective nitpicking, but I find it a bit strange when a comment > about a block is nested in the block, like in _bt_first() for the > array->null_elem check. Will fix. > - assignProcTypes() claims providing skipscan for cross-type scenarios > doesn't make sense. Why is that? I'm not saying the claim is wrong, but > it's not clear to me why would that be the case. It is just talking about the support function that skip scan can optionally use, where it makes sense (skip support functions). The relevant "else if (member->number == BTSKIPSUPPORT_PROC)" stanza is largely copied from the existing nearby "else if (member->number == BTEQUALIMAGE_PROC)" stanza that was added for B-Tree deduplication. In both stanzas we're talking about a capability that maps to a particular "input opclass", which means the opclass that maps to the datums that are stored on disk, in index tuples. There are no restrictions on the use of skip scan with queries that happen to involve the use of cross-type operators. It doesn't even matter if we happen to be using an incomplete opfamily, since range skip arrays never need to *directly* take the current array element from a lower/upper bound inequality scan key's argument. It all happens indirectly: code in places like _bt_first and _bt_checkkeys can use inequalities (which are stored in BTArrayKeyInfo.low_compare and BTArrayKeyInfo.high_compare) to locate the next matching on-disk index tuple that satisfies the inequality in question. Obviously, the located datum must be the same type as the one used by the array and its scan key (it has to be the input opclass type if it's taken from an index tuple). I think that it's a bit silly that nbtree generally bends over backwards to find a way to execute a scan, given an incomplete opfamily; in a green field situation it would make sense to just throw an error instead. Even still, skip scan works in a way that is maximally forgiving when incomplete opfamilies are used. Admittedly, it is just about possible to come up with a scenario where we'll now throw an error for a query that would have worked on Postgres 17. But that's no different to what would happen if the query had an explicit "= any( )" non-cross-type array instead of an implicit non-cross-type skip array. The real problem in these scenarios is the lack of a suitable cross-type ORDER proc (for a cross-type-operator query) within _bt_first -- not the lack of cross-type operators. This issue with missing ORDER procs just doesn't seem worth worrying about, since, as I said, even slightly different queries (that don't use skip scan) are bound to throw the same errors either way. > Peter asked me to look at the costing, and I think it looks generally > sensible. I'm glad that you think that I basically have the right idea here. Hard to know how to approach something like this, which doesn't have any kind of precedent to draw on. > We don't really have a lot of information to base the costing > on in the first place - the whole point of skipscan is about multicolumn > indexes, but none of the existing extended statistic seems very useful. > We'd need some cross-column correlation info, or something like that. Maybe, but that would just mean that we'd sometimes be more optimistic about skip scan helping than we are with the current approach of pessimistically assuming that there is no correlation at all. Not clear that being pessimistic in this sense isn't the right thing to do, despite the fact that it's clearly less accurate on average. > There's one thing that I don't quite understand, and that's how > btcost_correlation() adjusts correlation for multicolumn indexes: > > if (index->nkeycolumns > 1) > indexCorrelation = varCorrelation * 0.75; > > That seems fine for a two-column index, I guess. But shouldn't it > compound for indexes with more keys? I mean, 0.75 * 0.75 for third > column, etc? I don't think btcostestimate() does that, it just remembers > whatever btcost_correlation() returns. I don't know either. In general I'm out of my comfort zone here. > The only alternative approach I can think of is not to adjust the > costing for the index scan at all, and only use this to enable (or not > enable) the skipscan internally. That would mean the overall plan > remains the same, and maybe sometimes we would think an index scan would > be too expensive and use something else. Not great, but it doesn't have > the risk of regressions - IIUC we can disable the skipscan at runtime, > if we realize it's not really helpful. In general I would greatly prefer to not have a distinct kind of index path for scans that use skip scan. I'm quite keen on a design that allows the scan to adapt to unpredictable conditions at runtime. Of course, that doesn't preclude passing the index scan a hint about what's likely to work at runtime, based on information figured out when costing the scan. Perhaps that will prove necessary to avoid regressing index scans that are naturally quite cheap already -- scans where we really need to have the right general idea from the start to avoid any regressions. I'm not opposed to that, provided the index scan has the ability to change its mind when (for whatever reason) the guidance from the optimizer turns out to be wrong. > As usual, I wrote a bash script to do a bit of stress testing. It > generates tables with random data, and then runs random queries with > random predicates on them, while mutating a couple parameters (like > number of workers) to trigger different plans. It does that on 16, > master and with the skipscan patch (with the fix for parallel scans). I wonder if some of the regressions you see can be tied to the use of an LWLock in place of the existing use of a spin lock. I did that because I sometimes need to allocate memory to deserialize the array keys, with the exclusive lock held. It might be the case that a lot of these regressions are tied to that, or something else that is far from obvious...have to investigate. In general, I haven't done much on parallel index scans here (I only added support for them very recently), whereas your testing places a lot of emphasis on parallel scans. Nothing wrong with that emphasis (it caught that 17 bug), but just want to put it in context. > I've uploaded the script and results from the last run here: > > https://github.com/tvondra/pg-skip-scan-tests > > There's the "run-mdam.sh" script that generates tables/queries, runs > them, collects all kinds of info about the query, and produces files > with explain plans, CSV with timings, etc. It'll take me a while to investigate all this data. > Anyway, I ran a couple thousand such queries, and I haven't found any > incorrect results (the script compares that between versions too). So > that's good ;-) That's good! > But my main goal was to see how this affects performance. The tables > were pretty small (just 1M rows, maybe ~80MB), but with restarts and > dropping caches, large enough to test this. The really compelling cases all tend to involve fairly selective index scans. Obviously, skip scan can only save work by navigating the index structure more efficiently (unlike loose index scan). So if the main cost is inherently bound to be the cost of heap accesses, then we shouldn't expect a big speed up. > For example, one of the slowed down queries is query 702 (top of page 8 > in the PDF). The query is pretty simple: > > explain (analyze, timing off, buffers off) > select id1,id2 from t_1000000_1000_1_2 > where NOT (id1 in (:list)) AND (id2 = :value); > > and it was executed on a table with random data in two columns, each > with 1000 distinct values. This is perfectly random data, so a great > match for the assumptions in costing etc. > > But with uncached data, this runs in ~50 ms on master, but takes almost > 200 ms with skipscan (these timings are from my laptop, but similar to > the results). I'll need to investigate this specifically. That does seem odd. FWIW, it's a pity that the patch doesn't know how to push down the NOT IN () here. The MDAM paper contemplates such a scheme. We see the use of filter quals here, when in principle this could work by using a skip array that doesn't generate elements that appear in the NOT IN() list (it'd generate every possible indexable value *except* the given list/array values). The only reason that I haven't implemented this yet is because I'm not at all sure how to make it work on the optimizer side. The nbtree side of the implementation will probably be quite straightforward, since it's really just a slight variant of a skip array, that excludes certain values. > -- with skipscan > Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on > t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16) > (actual rows=811 loops=1) > Index Cond: (id2 = 997) > Index Searches: 1007 > Filter: (id1 <> ALL ('{983,...,640}'::bigint[])) > Rows Removed by Filter: 163 > Heap Fetches: 0 > Planning Time: 3.730 ms > Execution Time: 238.554 ms > (8 rows) > > I haven't looked into why this is happening, but this seems like a > pretty good match for skipscan (on the first column). And for the > costing too - it's perfectly random data, no correllation, etc. I wonder what "Buffers: N" shows? That's usually the first thing I look at (that and "Index Searches", which looks like what you said it should look like here). But, yeah, let me get back to you on this. Thanks again! -- Peter Geoghegan