Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Peter Geoghegan Mon, 24 Jul 2023 18:34:37 -0700

I've been working on a variety of improvements to nbtree's native
ScalarArrayOpExpr execution. This builds on Tom's work in commit
9e8da0f7.


Attached patch is still at the prototype stage. I'm posting it as v1 a
little earlier than I usually would because there has been much back
and forth about it on a couple of other threads involving Tomas Vondra
and Jeff Davis -- seems like it would be easier to discuss with
working code available.

The patch adds two closely related enhancements to ScalarArrayOp
execution by nbtree:

1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
"advance the scan's array keys locally", which sometimes avoids
significant amounts of unneeded pinning/locking of the same set of
index pages.

SAOP index scans become capable of eliding primitive index scans for
the next set of array keys in line in cases where it isn't truly
necessary to descend the B-Tree again. Index scans are now capable of
"sticking with the existing leaf page for now" when it is determined
that the end of the current set of array keys is physically close to
the start of the next set of array keys (the next set in line to be
materialized by the _bt_advance_array_keys state machine). This is
often possible.

Naturally, we still prefer to advance the array keys in the
traditional way ("globally") much of the time. That means we'll
perform another _bt_first/_bt_search descent of the index, starting a
new primitive index scan. Whether we try to skip pages on the leaf
level or stick with the current primitive index scan (by advancing
array keys locally) is likely to vary a great deal. Even during the
same index scan. Everything is decided dynamically, which is the only
approach that really makes sense.

This optimization can significantly lower the number of buffers pinned
and locked in cases with significant locality, and/or with many array
keys with no matches. The savings (when measured in buffers
pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
has shown that transaction throughput for variants of "pgbench -S"
designed to stress the implementation (hundreds of array constants)
under concurrent load can have up to 5.5x higher transaction
throughput with the patch. Less extreme cases (10 array constants,
spaced apart) see about a 20% improvement in throughput. There are
similar improvements to latency for the patch, in each case.

2. The optimizer now produces index paths with multiple SAOP clauses
(or other clauses we can safely treat as "equality constraints'') on
each of the leading columns from a composite index -- all while
preserving index ordering/useful pathkeys in most cases.

The nbtree work from item 1 is useful even with the simplest IN() list
query involving a scan of a single column index. Obviously, it's very
inefficient for the nbtree code to use 100 primitive index scans when
1 is sufficient. But that's not really why I'm pursuing this project.
My real goal is to implement (or to enable the implementation of) a
whole family of useful techniques for multi-column indexes. I call
these "MDAM techniques", after the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1][2]-- MDAM is short for "multidimensional
access method". In the context of the paper, "dimension" refers to
dimensions in a decision support system.

The most compelling cases for the patch all involve multiple index
columns with multiple SAOP clauses (especially where each column
represents a separate "dimension", in the DSS sense). It's important
that index sort be preserved whenever possible, too. Sometimes this is
directly useful (e.g., because the query has an ORDER BY), but it's
always indirectly needed, on the nbtree side (when the optimizations
are applicable at all). The new nbtree code now has special
requirements surrounding SAOP search type scan keys with composite
indexes. These requirements make changes in the optimizer all but
essential.

Index order
===========

As I said, there are cases where preserving index order is immediately
and obviously useful, in and of itself. Let's start there.

Here's a test case that you can run against the regression test database:

pg@regression:5432 =# create index order_by_saop on tenk1(two,four,twenty);
CREATE INDEX

pg@regression:5432 =# EXPLAIN (ANALYZE, BUFFERS)
select ctid, thousand from tenk1
where two in (0,1) and four in (1,2) and twenty in (1,2)
order by two, four, twenty limit 20;

With the patch, this query gets 13 buffer hits. On the master branch,
it gets 1377 buffer hits -- which exceeds the number you'll get from a
sequential scan by about 4x. No coaxing was required to get the
planner to produce this plan on the master branch. Almost all of the
savings shown here are related to heap page buffer hits -- the nbtree
changes don't directly help in this particular example (strictly
speaking, you only need the optimizer changes to get this result).

Obviously, the immediate reason why the patch wins by so much is
because it produces a plan that allows the LIMIT to terminate the scan
far sooner. Benoit Tigeot (CC'd) happened to run into this issue
organically -- that was also due to heap hits, a LIMIT, and so on. As
luck would have it, I stumbled upon his problem report (in the
Postgres slack channel) while I was working on this patch. He produced
a fairly complete test case, which was helpful [3]. This example is
more or less just a distillation of his test case, designed to be easy
for a Postgres hacker to try out for themselves.

There are also variants of this query where a LIMIT isn't the crucial
factor, and where index page hits are the problem. This query uses an
index-only scan, both on master and with the patch (same index as
before):

select count(*), two, four, twenty
from tenk1
where two in (0, 1) and four in (1, 2, 3, 4) and
twenty in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14,15)
group by two, four, twenty
order by two, four, twenty;

The patch gets 18 buffer hits for this query. That outcome makes
intuitive sense, since this query is highly unselective -- it's
approaching the selectivity of the query "select count(*) from tenk1".
The simple count(*) query gets 19 buffer hits for its own index-only
scan, confirming that the patch managed to skip only one or two leaf
pages in the complicated "group by" variant of the count(*) query.
Overall, the GroupAggregate plan used by the patch is slower than the
simple count(*) case (despite touching fewer pages). But both plans
have *approximately* the same execution cost, which makes sense, since
they both have very similar selectivities.

The master branch gets 245 buffer hits for the same group by query.
This is almost as many hits as a sequential scan would require -- even
though there are precisely zero heap accesses needed by the underlying
index-only scan. As with the first example, no planner coaxing was
required to get this outcome on master. It is inherently very
difficult to predict how selective a query like this will be using
conventional statistics. But that's not actually the problem in this
example -- the planner gets that part right, on this occasion. The
real problem is that there is a multiplicative factor to worry about
on master, when executing multiple SAOPs. That makes it almost
impossible to predict the number of pages we'll pin. While with the
patch, scans with multiple SAOPs are often fairly similar to scans
that happen to just have one on the leading column.

With the patch, it is simply impossible for an SAOP index scan to
visit any single leaf page more than once. Just like a conventional
index scan. Whereas right now, on master, using more than one SAOP
clause for a multi column index seems to me to be a wildly risky
proposition. You can easily have cases that work just fine on master,
while only slight variations of the same query see costs explode
(especially likely with a LIMIT). ISTM that there is significant value
in knowing for sure in having a pretty accurate idea of the worst case
in the planner.

Giving nbtree the ability to skip or not skip dynamically, based on
actual conditions in the index (not on statistics), seems like it has
a lot of potential as a way of improving performance *stability*.
Personally I'm most interested in this aspect of the project.

Note: we can visit internal pages more than once, but that seems to
make a negligible difference to the overall cost profile of scans. Our
policy is to not charge an I/O cost for those pages. Plus, the number
of internal page access is dramatically reduced (it's just not
guaranteed that there won't be any repeat accesses for internal pages,
is all).

Note also: there are hard-to-pin-down interactions between the
immediate problem on the nbtree side, and the use of filter quals
rather than true index quals, where the use of index quals is possible
in principle. Some problematic cases see excessive amounts of heap
page hits only (as with my first example query). Other problematic
cases see excessive amounts of index page hits, with little to no
impact on heap page hits at all (as with my second example query).
Some combination of the two is also possible.

Safety
======

As mentioned already, the ability to "advance the current set of array
keys locally" during a scan (the nbtree work in item 1) actually
relies the optimizer work in item 2 -- it's not just a question of
unlocking the potential of the nbtree work. Now I'll discuss those
aspects in a bit more detail.

Without the optimizer work, nbtree will produce wrong answers to
queries, in a way that resembles the complaint addressed by historical
bugfix commit 807a40c5. This incorrect behavior (if the optimizer were
to permit it) would only be seen when there are multiple
arrays/columns, and an inequality on a leading column -- just like
with that historical bug.  (It works both ways, though -- the nbtree
changes also make the optimizer changes safe by limiting the worst
case, which would otherwise be too much of a risk to countenance. You
can't separate one from the other.)

The primary change on the optimizer side is the addition of logic to
differentiate between the following two cases when building an index
path in indxpath.c:

* Unsafe: Cases where it's fundamentally unsafe to treat
multi-column-with-SAOP-clause index paths as returning tuples in a
useful sort order.

For example, the test case committed as part of that bugfix involves
an inequality, so it continues to be treated as unsafe.

* Safe: Cases where (at least in theory) bugfix commit 807a40c5 went
further than it really had to.

Those cases get to use the optimization, and usually get to have
useful path keys.

My optimizer changes are very kludgey. I came up with various ad-hoc
rules to distinguish between the safe and unsafe cases, without ever
really placing those changes into some kind of larger framework. That
was enough to validate the general approach in nbtree, but it
certainly has problems -- glaring problems. The biggest problem of all
may be my whole "safe vs unsafe" framing itself. I know that many of
the ostensibly unsafe cases are in fact safe (with the right
infrastructure in place), because the MDAM paper says just that. The
optimizer can't support inequalities right now, but the paper
describes how to support "NOT IN( )" lists -- clearly an inequality!
The current ad-hoc rules are at best incomplete, and at worst are
addressing the problem in fundamentally the wrong way.

 CNF -> DNF conversion
=====================

Like many great papers, the MDAM paper takes one core idea, and finds
ways to leverage it to the hilt. Here the core idea is to take
predicates in conjunctive normal form (an "AND of ORs"), and convert
them into disjunctive normal form (an "OR of ANDs"). DNF quals are
logically equivalent to CNF quals, but ideally suited to SAOP-array
style processing by an ordered B-Tree index scan -- they reduce
everything to a series of non-overlapping primitive index scans, that
can be processed in keyspace order. We already do this today in the
case of SAOPs, in effect. The nbtree "next array keys" state machine
already materializes values that can be seen as MDAM style DNF single
value predicates. The state machine works by outputting the cartesian
product of each array as a multi-column index is scanned, but that
could be taken a lot further in the future. We can use essentially the
same kind of state machine to do everything described in the paper --
ultimately, it just needs to output a list of disjuncts, like the DNF
clauses that the paper shows in "Table 3".

In theory, anything can be supported via a sufficiently complete CNF
-> DNF conversion framework. There will likely always be the potential
for unsafe/unsupported clauses and/or types in an extensible system
like Postgres, though. So we will probably need to retain some notion
of safety. It seems like more of a job for nbtree preprocessing (or
some suitably index-AM-agnostic version of the same idea) than the
optimizer, in any case. But that's not entirely true, either (that
would be far too easy).

The optimizer still needs to optimize. It can't very well do that
without having some kind of advanced notice of what is and is not
supported by the index AM. And, the index AM cannot just unilaterally
decide that index quals actually should be treated as filter/qpquals,
after all -- it doesn't get a veto. So there is a mutual dependency
that needs to be resolved. I suspect that there needs to be a two way
conversation between the optimizer and nbtree code to break the
dependency -- a callback that does some of the preprocessing work
during planning. Tom said something along the same lines in passing,
when discussing the MDAM paper last year [2]. Much work remains here.

Skip Scan
=========

MDAM encompasses something that people tend to call "skip scan" --
terminology with a great deal of baggage. These days I prefer to call
it "filling in missing key predicates", per the paper. That's much
more descriptive, and makes it less likely that people will conflate
the techniques with InnoDB style "loose Index scans" -- the latter is
a much more specialized/targeted optimization. (I now believe that
these are very different things, though I was thrown off by the
superficial similarities for a long time. It's pretty confusing.)

I see this work as a key enabler of "filling in missing key
predicates". MDAM describes how to implement this technique by
applying the same principles that it applies everywhere else: it
proposes a scheme that converts predicates from CNF to DNF. With just
a little extra logic required to do index probes to feed the
DNF-generating state machine, on demand.

More concretely, in Postgres terms: skip scan can be implemented by
inventing a new placeholder clause that can be composed alongside
ScalarArrayOpExprs, in the same way that multiple ScalarArrayOpExprs
can be composed together in the patch already. I'm thinking of a type
of clause that makes the nbtree code materialize a set of "array keys"
for a SK_SEARCHARRAY scan key dynamically, via ad-hoc index probes
(perhaps static approaches would be better for types like boolean,
which the paper contemplates). It should be possible to teach the
_bt_advance_array_keys state machine to generate those values in
approximately the same fashion as it already does for
ScalarArrayOpExprs -- and, it shouldn't be too hard to do it in a
localized fashion, allowing everything else to continue to work in the
same way without any special concern. This separation of concerns is a
nice consequence of the way that the MDAM design really leverages
preprocessing/DNF for everything.

Both types of clauses can be treated as part of a general class of
ScalarArrayOpExpr-like clauses. Making the rules around
"composability" simple will be important.

Although skip scan gets a lot of attention, it's not necessarily the
most compelling MDAM technique. It's also not especially challenging
to implement on top of everything else. It really isn't that special.
Right now I'm focussed on the big picture, in any case. I want to
emphasize the very general nature of these techniques. Although I'm
focussed on SOAPs in the short term, many queries that don't make use
of SAOPs should ultimately see similar benefits. For example, the
paper also describes transformations that apply to BETWEEN/range
predicates. We might end up needing a third type of expression for
those. They're all just DNF single value predicates, under the hood.

Thoughts?

[1] http://vldb.org/conf/1995/P710.PDF
[2] https://www.postgresql.org/message-id/2587523.1647982...@sss.pgh.pa.us
[3] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d
--
Peter Geoghegan

v1-0001-Enhance-nbtree-ScalarArrayOp-execution.patch
Description: Binary data

Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Reply via email to