I've been working on a variety of improvements to nbtree's native ScalarArrayOpExpr execution. This builds on Tom's work in commit 9e8da0f7.
Attached patch is still at the prototype stage. I'm posting it as v1 a little earlier than I usually would because there has been much back and forth about it on a couple of other threads involving Tomas Vondra and Jeff Davis -- seems like it would be easier to discuss with working code available. The patch adds two closely related enhancements to ScalarArrayOp execution by nbtree: 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now "advance the scan's array keys locally", which sometimes avoids significant amounts of unneeded pinning/locking of the same set of index pages. SAOP index scans become capable of eliding primitive index scans for the next set of array keys in line in cases where it isn't truly necessary to descend the B-Tree again. Index scans are now capable of "sticking with the existing leaf page for now" when it is determined that the end of the current set of array keys is physically close to the start of the next set of array keys (the next set in line to be materialized by the _bt_advance_array_keys state machine). This is often possible. Naturally, we still prefer to advance the array keys in the traditional way ("globally") much of the time. That means we'll perform another _bt_first/_bt_search descent of the index, starting a new primitive index scan. Whether we try to skip pages on the leaf level or stick with the current primitive index scan (by advancing array keys locally) is likely to vary a great deal. Even during the same index scan. Everything is decided dynamically, which is the only approach that really makes sense. This optimization can significantly lower the number of buffers pinned and locked in cases with significant locality, and/or with many array keys with no matches. The savings (when measured in buffers pined/locked) can be as high as 10x, 100x, or even more. Benchmarking has shown that transaction throughput for variants of "pgbench -S" designed to stress the implementation (hundreds of array constants) under concurrent load can have up to 5.5x higher transaction throughput with the patch. Less extreme cases (10 array constants, spaced apart) see about a 20% improvement in throughput. There are similar improvements to latency for the patch, in each case. 2. The optimizer now produces index paths with multiple SAOP clauses (or other clauses we can safely treat as "equality constraints'') on each of the leading columns from a composite index -- all while preserving index ordering/useful pathkeys in most cases. The nbtree work from item 1 is useful even with the simplest IN() list query involving a scan of a single column index. Obviously, it's very inefficient for the nbtree code to use 100 primitive index scans when 1 is sufficient. But that's not really why I'm pursuing this project. My real goal is to implement (or to enable the implementation of) a whole family of useful techniques for multi-column indexes. I call these "MDAM techniques", after the 1995 paper "Efficient Search of Multidimensional B-Trees" [1][2]-- MDAM is short for "multidimensional access method". In the context of the paper, "dimension" refers to dimensions in a decision support system. The most compelling cases for the patch all involve multiple index columns with multiple SAOP clauses (especially where each column represents a separate "dimension", in the DSS sense). It's important that index sort be preserved whenever possible, too. Sometimes this is directly useful (e.g., because the query has an ORDER BY), but it's always indirectly needed, on the nbtree side (when the optimizations are applicable at all). The new nbtree code now has special requirements surrounding SAOP search type scan keys with composite indexes. These requirements make changes in the optimizer all but essential. Index order =========== As I said, there are cases where preserving index order is immediately and obviously useful, in and of itself. Let's start there. Here's a test case that you can run against the regression test database: pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty); CREATE INDEX pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS) select ctid, thousand from tenk1 where two in (0,1) and four in (1,2) and twenty in (1,2) order by two, four, twenty limit 20; With the patch, this query gets 13 buffer hits. On the master branch, it gets 1377 buffer hits -- which exceeds the number you'll get from a sequential scan by about 4x. No coaxing was required to get the planner to produce this plan on the master branch. Almost all of the savings shown here are related to heap page buffer hits -- the nbtree changes don't directly help in this particular example (strictly speaking, you only need the optimizer changes to get this result). Obviously, the immediate reason why the patch wins by so much is because it produces a plan that allows the LIMIT to terminate the scan far sooner. Benoit Tigeot (CC'd) happened to run into this issue organically -- that was also due to heap hits, a LIMIT, and so on. As luck would have it, I stumbled upon his problem report (in the Postgres slack channel) while I was working on this patch. He produced a fairly complete test case, which was helpful [3]. This example is more or less just a distillation of his test case, designed to be easy for a Postgres hacker to try out for themselves. There are also variants of this query where a LIMIT isn't the crucial factor, and where index page hits are the problem. This query uses an index-only scan, both on master and with the patch (same index as before): select count(*), two, four, twenty from tenk1 where two in (0, 1) and four in (1, 2, 3, 4) and twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14,15) group by two, four, twenty order by two, four, twenty; The patch gets 18 buffer hits for this query. That outcome makes intuitive sense, since this query is highly unselective -- it's approaching the selectivity of the query "select count(*) from tenk1". The simple count(*) query gets 19 buffer hits for its own index-only scan, confirming that the patch managed to skip only one or two leaf pages in the complicated "group by" variant of the count(*) query. Overall, the GroupAggregate plan used by the patch is slower than the simple count(*) case (despite touching fewer pages). But both plans have *approximately* the same execution cost, which makes sense, since they both have very similar selectivities. The master branch gets 245 buffer hits for the same group by query. This is almost as many hits as a sequential scan would require -- even though there are precisely zero heap accesses needed by the underlying index-only scan. As with the first example, no planner coaxing was required to get this outcome on master. It is inherently very difficult to predict how selective a query like this will be using conventional statistics. But that's not actually the problem in this example -- the planner gets that part right, on this occasion. The real problem is that there is a multiplicative factor to worry about on master, when executing multiple SAOPs. That makes it almost impossible to predict the number of pages we'll pin. While with the patch, scans with multiple SAOPs are often fairly similar to scans that happen to just have one on the leading column. With the patch, it is simply impossible for an SAOP index scan to visit any single leaf page more than once. Just like a conventional index scan. Whereas right now, on master, using more than one SAOP clause for a multi column index seems to me to be a wildly risky proposition. You can easily have cases that work just fine on master, while only slight variations of the same query see costs explode (especially likely with a LIMIT). ISTM that there is significant value in knowing for sure in having a pretty accurate idea of the worst case in the planner. Giving nbtree the ability to skip or not skip dynamically, based on actual conditions in the index (not on statistics), seems like it has a lot of potential as a way of improving performance *stability*. Personally I'm most interested in this aspect of the project. Note: we can visit internal pages more than once, but that seems to make a negligible difference to the overall cost profile of scans. Our policy is to not charge an I/O cost for those pages. Plus, the number of internal page access is dramatically reduced (it's just not guaranteed that there won't be any repeat accesses for internal pages, is all). Note also: there are hard-to-pin-down interactions between the immediate problem on the nbtree side, and the use of filter quals rather than true index quals, where the use of index quals is possible in principle. Some problematic cases see excessive amounts of heap page hits only (as with my first example query). Other problematic cases see excessive amounts of index page hits, with little to no impact on heap page hits at all (as with my second example query). Some combination of the two is also possible. Safety ====== As mentioned already, the ability to "advance the current set of array keys locally" during a scan (the nbtree work in item 1) actually relies the optimizer work in item 2 -- it's not just a question of unlocking the potential of the nbtree work. Now I'll discuss those aspects in a bit more detail. Without the optimizer work, nbtree will produce wrong answers to queries, in a way that resembles the complaint addressed by historical bugfix commit 807a40c5. This incorrect behavior (if the optimizer were to permit it) would only be seen when there are multiple arrays/columns, and an inequality on a leading column -- just like with that historical bug. (It works both ways, though -- the nbtree changes also make the optimizer changes safe by limiting the worst case, which would otherwise be too much of a risk to countenance. You can't separate one from the other.) The primary change on the optimizer side is the addition of logic to differentiate between the following two cases when building an index path in indxpath.c: * Unsafe: Cases where it's fundamentally unsafe to treat multi-column-with-SAOP-clause index paths as returning tuples in a useful sort order. For example, the test case committed as part of that bugfix involves an inequality, so it continues to be treated as unsafe. * Safe: Cases where (at least in theory) bugfix commit 807a40c5 went further than it really had to. Those cases get to use the optimization, and usually get to have useful path keys. My optimizer changes are very kludgey. I came up with various ad-hoc rules to distinguish between the safe and unsafe cases, without ever really placing those changes into some kind of larger framework. That was enough to validate the general approach in nbtree, but it certainly has problems -- glaring problems. The biggest problem of all may be my whole "safe vs unsafe" framing itself. I know that many of the ostensibly unsafe cases are in fact safe (with the right infrastructure in place), because the MDAM paper says just that. The optimizer can't support inequalities right now, but the paper describes how to support "NOT IN( )" lists -- clearly an inequality! The current ad-hoc rules are at best incomplete, and at worst are addressing the problem in fundamentally the wrong way. CNF -> DNF conversion ===================== Like many great papers, the MDAM paper takes one core idea, and finds ways to leverage it to the hilt. Here the core idea is to take predicates in conjunctive normal form (an "AND of ORs"), and convert them into disjunctive normal form (an "OR of ANDs"). DNF quals are logically equivalent to CNF quals, but ideally suited to SAOP-array style processing by an ordered B-Tree index scan -- they reduce everything to a series of non-overlapping primitive index scans, that can be processed in keyspace order. We already do this today in the case of SAOPs, in effect. The nbtree "next array keys" state machine already materializes values that can be seen as MDAM style DNF single value predicates. The state machine works by outputting the cartesian product of each array as a multi-column index is scanned, but that could be taken a lot further in the future. We can use essentially the same kind of state machine to do everything described in the paper -- ultimately, it just needs to output a list of disjuncts, like the DNF clauses that the paper shows in "Table 3". In theory, anything can be supported via a sufficiently complete CNF -> DNF conversion framework. There will likely always be the potential for unsafe/unsupported clauses and/or types in an extensible system like Postgres, though. So we will probably need to retain some notion of safety. It seems like more of a job for nbtree preprocessing (or some suitably index-AM-agnostic version of the same idea) than the optimizer, in any case. But that's not entirely true, either (that would be far too easy). The optimizer still needs to optimize. It can't very well do that without having some kind of advanced notice of what is and is not supported by the index AM. And, the index AM cannot just unilaterally decide that index quals actually should be treated as filter/qpquals, after all -- it doesn't get a veto. So there is a mutual dependency that needs to be resolved. I suspect that there needs to be a two way conversation between the optimizer and nbtree code to break the dependency -- a callback that does some of the preprocessing work during planning. Tom said something along the same lines in passing, when discussing the MDAM paper last year [2]. Much work remains here. Skip Scan ========= MDAM encompasses something that people tend to call "skip scan" -- terminology with a great deal of baggage. These days I prefer to call it "filling in missing key predicates", per the paper. That's much more descriptive, and makes it less likely that people will conflate the techniques with InnoDB style "loose Index scans" -- the latter is a much more specialized/targeted optimization. (I now believe that these are very different things, though I was thrown off by the superficial similarities for a long time. It's pretty confusing.) I see this work as a key enabler of "filling in missing key predicates". MDAM describes how to implement this technique by applying the same principles that it applies everywhere else: it proposes a scheme that converts predicates from CNF to DNF. With just a little extra logic required to do index probes to feed the DNF-generating state machine, on demand. More concretely, in Postgres terms: skip scan can be implemented by inventing a new placeholder clause that can be composed alongside ScalarArrayOpExprs, in the same way that multiple ScalarArrayOpExprs can be composed together in the patch already. I'm thinking of a type of clause that makes the nbtree code materialize a set of "array keys" for a SK_SEARCHARRAY scan key dynamically, via ad-hoc index probes (perhaps static approaches would be better for types like boolean, which the paper contemplates). It should be possible to teach the _bt_advance_array_keys state machine to generate those values in approximately the same fashion as it already does for ScalarArrayOpExprs -- and, it shouldn't be too hard to do it in a localized fashion, allowing everything else to continue to work in the same way without any special concern. This separation of concerns is a nice consequence of the way that the MDAM design really leverages preprocessing/DNF for everything. Both types of clauses can be treated as part of a general class of ScalarArrayOpExpr-like clauses. Making the rules around "composability" simple will be important. Although skip scan gets a lot of attention, it's not necessarily the most compelling MDAM technique. It's also not especially challenging to implement on top of everything else. It really isn't that special. Right now I'm focussed on the big picture, in any case. I want to emphasize the very general nature of these techniques. Although I'm focussed on SOAPs in the short term, many queries that don't make use of SAOPs should ultimately see similar benefits. For example, the paper also describes transformations that apply to BETWEEN/range predicates. We might end up needing a third type of expression for those. They're all just DNF single value predicates, under the hood. Thoughts? [1] http://vldb.org/conf/1995/P710.PDF [2] https://www.postgresql.org/message-id/2587523.1647982...@sss.pgh.pa.us [3] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d -- Peter Geoghegan
v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patch
Description: Binary data