I spent some time looking into the performance complaint at [1], which for the sake of self-containedness is
CREATE TABLE t(a int, b int); INSERT INTO t(a, b) SELECT (random() * 123456)::int AS a, (random() * 123456)::int AS b FROM generate_series(1, 12345678); CREATE INDEX my_idx ON t USING BTREE (a, b); VACUUM ANALYZE t; explain (analyze, buffers) select * from t where row(a, b) > row(123450, 123444) and a = 0 order by a, b; This produces something like Index Only Scan using my_idx on t (cost=0.43..8.46 rows=1 width=8) (actual time=475.713..475.713 rows=0 loops=1) Index Cond: ((ROW(a, b) > ROW(123450, 123444)) AND (a = 0)) Heap Fetches: 0 Buffers: shared hit=1 read=33731 Planning: Buffers: shared read=4 Planning Time: 0.247 ms Execution Time: 475.744 ms showing that we are reading practically the whole index, which is pretty sad considering the index conditions are visibly mutually contradictory. What's going on? I find that: 1. _bt_preprocess_keys, which is responsible for detecting mutually-contradictory index quals, fails to do so because it really just punts on row-comparison quals: it shoves them directly from input to output scankey array without any comparisons to other keys. (In particular, that causes the row-comparison qual to appear before the a = 0 one in the output scankey array.) 2. The initial-positioning logic in _bt_first chooses "a = 0" as determining where to start the scan, because it always prefers equality over inequality keys. (This seems reasonable.) 3. We really should stop the scan once we're past the last a = 0 index entry, which'd at least limit the damage. However, at both a = 0 and later entries, the row-comparison qual fails causing _bt_check_compare to return immediately, without examining the a = 0 key which is marked as SK_BT_REQFWD, and thus it does not clear "continuescan". Only when we finally reach an index entry for which the row-comparison qual succeeds do we notice that a = 0 is failing so we could stop the scan. So this seems pretty horrid. It would be nice if _bt_preprocess_keys were smart enough to notice the contradictory nature of these quals, but I grant that (a) that routine is dauntingly complex already and (b) this doesn't seem like a common enough kind of query to be worth moving heaven and earth to optimize. However, I do think we should do something about the unstated assumption that _bt_preprocess_keys can emit the quals (for a given column) in any random order it feels like. This is evidently not so, and it's probably capable of pessimizing other examples besides this one. Unless we want to slow down _bt_check_compare by making it continue to examine quals after the first failure, we need to insist that required quals appear before non-required quals, thus ensuring that a comparison failure will clear continuescan if possible. Even that looks a little nontrivial, because it seems like nbtree may be making some assumptions about the order in which array keys appear. I see the bit about * ... Some reordering of the keys * within each attribute may be done as a byproduct of the processing here. * That process must leave array scan keys (within an attribute) in the same * order as corresponding entries from the scan's BTArrayKeyInfo array info. which I could cope with, but then there's this down around line 2967: * Note: We do things this way around so that our arrays are * always in the same order as their corresponding scan keys, * even with incomplete opfamilies. _bt_advance_array_keys * depends on this. However, despite the rather over-the-top verbosity of commenting in _bt_advance_array_keys, it's far from clear why or how it depends on that. So I feel a little stuck about what needs to be done here. regards, tom lane [1] https://www.postgresql.org/message-id/CAAdwFAxBjyrYUkH7u%2BEceTaztd1QxBtBY1Teux8K%3DvcGKe%3D%3D-A%40mail.gmail.com