I returned to the 1995 paper "Efficient Search of Multidimensional B-Trees" [1] as part of the process of reviewing v39 of the skip scan patch, which was posted back in May. It's a great paper, and anybody involved in the skip scan effort should read it thoroughly (if they haven't already). It's easy to see why people get excited about skip scan [2]. But there is a bigger picture here.
I don't necessarily expect to come away from this discussion with a much better high level architecture for the patch, or any kind of deeper insight, or even a frame of reference for further discussion. I just think that we ought to *try* to impose some order on this stuff. Like many difficult patches, the skip scan patch is not so much troubled by problems with the implementation as it is troubled by *ambiguity* about the design. Particularly concerning how skip scan meshes with existing designs, as well as future designs -- particularly designs for other MDAM techniques. I've started this thread to have a big picture conversation about how to think about these things. Many other MDAM techniques also seem highly appealing. Much of the MDAM stuff is for data warehousing use-cases, while skip scan/loose index scan is seen as more of an OLTP thing. But they are still related, clearly. I'd like to also talk about another patch, that ISTM had that same quality -- it was also held back by high level design uncertainty. Back in 2018, Tom abandoned a patch that transformed a star-schema style query with left outer joins on dimension tables with OR conditions, into an equivalent query that UNIONs together 2 distinct queries [3][4]. Believe it or not, I am now reminded of that patch by the example of "IN() Lists", from page 5 of the paper. We see this example SQL query: SELECT date, item_class, store, sum(total_sales) FROM sales WHERE date between '06/01/95' and '06/30/95' and item_class IN (20,35,50) and store IN (200,250) GROUP BY dept, date, item_class, store; Granted, this SQL might not seem directly relevant to Tom's patch at first -- there is no join for the optimizer to even try to eliminate, which was the whole basis of Jim Nasby's original complaint, which is what spurred Tom to write the patch in the first place. But hear me out: there is still a fact table (the sales table) with some dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the table (and this SQL query) drives discussion of an optimization involving transforming a predicate with many ORs (which is explicitly said to be logically/semantically equivalent to the IN() lists from the query). They transform the query into a bunch of disjunct clauses that can easily be independently executed, and combined at the end (see also "General OR Optimization" on page 6 of the paper). Also...I'm not entirely sure that the intended underlying "physical plan" is truly free of join-like scans. If you squint just right, you might see something that you could think of as a "physical join" (at least very informally). The whole point of this particular "IN() Lists" example is that we get to the following, for each distinct "dept" and "date" in the table: dept=1, date='06/04/95', item_class=20, store=200 dept=1, date='06/04/95', item_class=20, store=250 dept=1, date='06/04/95', item_class=35, store=200 dept=1, date='06/04/95', item_class=35, store=250 dept=1, date='06/04/95', item_class=50, store=200 dept=1, date='06/04/95', item_class=50, store=250 There are 2400 such accesses in total after transformation -- imagine additional lines like these, for every distinct combination of dept and date (only for those dates that actually had sales, which they enumerate up-front), for store 200 and 250, and item_class 20, 35, and 50. This adds up to 2400 lines in total. Even 2400 index probes will be much faster than a full table scan, given that this is a large fact table. The "sales" table is a clustered index whose keys are on the columns "(dept, date, item_class, store)", per note at the top of page 4. The whole point is to avoid having any secondary indexes on this fact table, without getting a full scan. We can just probe the primary key 2400 times instead, following this transformation. No need for secondary indexes. The plan can be thought of as a DAG, at least informally. This is also somewhat similar to what Tom was thinking about back in 2018. Tom had to deduplicate rows during execution (IIRC using a UNION style ad-hoc approach that sorted on TIDs), whereas I think that they can get away with skipping that extra step. Page 7 says "MDAM removes duplicates before reading the data, so it does not have to do any post read operations to accomplish duplicate elimination (a common problem with OR optimization)". My general concern is that the skip scan patch may currently be structured in a way that paints us into a corner, MDAM-wise. Note that the MDAM paper treats skipping a prefix of columns as a case where the prefix is handled by pretending that there is a clause that looks like this: "WHERE date between -inf AND +inf" -- which is not so different from the original sales SQL query example that I have highlighted. We don't tend to think of queries like this (like my sales query) as in any way related to skip-scan, because we don't imagine that there is any skipping going on. But maybe we should recognize the similarities. BTW, these imaginary -inf/+inf values seem to me to be just like the sentinel values already used inside nbtree, for pivot tuples -- they have explicit -inf values for truncated suffix key columns, and you can think of a rightmost page as having a +inf high key, per the L&Y paper. Wearing my B-Tree hat, I don't see much difference between imaginary -inf/+inf values, and values from the BETWEEN "date" range from the example SQL query. I have in the past wondered if _bt_get_endpoint() should have been implemented that way -- we could go through _bt_search() instead, and get rid of that code. All we need is insertion scan keys that can explicitly contain the same -inf/+inf sentinel values. Maybe this also allows us to get rid of BTScanInsertData.nextkey semantics (not sure offhand). Another more concrete concern about the patch series comes from the backwards scan stuff. This is added by a later patch in the patch series, "v39-0004-Extend-amskip-implementation-for-Btree.patch". It strikes me as a bad thing that we cannot just do leaf-page-at-a-time processing, without usually needing to hold a pin on the leaf page. After all, ordinary backwards scans manage to avoid that today, albeit by using trickery inside _bt_walk_left(). MDAM-style "Maintenance of Index Order" (as described on page 8) seems like a good goal for us here. I don't like the idea of doing ad-hoc duplicate TID elimination inside nbtree, across calls made from the executor (whether it's during backwards skip scans, or at any other time). Not because it seems to go against the approach taken by the MDAM paper (though it does); just because it seems kludgy. (I think that Tom felt the same way about the TID deduplication stuff in his own patch back in 2018, too.) Open question: What does all of this MDAM business mean for ScalarArrayOpExpr, if anything? I freely admit that I could easily be worrying over nothing here. But if I am, I'd really like to know *why* that's the case. [1] http://vldb.org/conf/1995/P710.PDF [2] https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/ [3] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com [4] https://www.postgresql.org/message-id/flat/14593.1517581614%40sss.pgh.pa.us#caf373b36332f25acb7673bff331c95e -- Peter Geoghegan