On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <p...@bowt.ie> wrote: > All of that said, maybe it would be clearer if page deletion was not > the special case that has to opt in to minusinfkey semantics. Perhaps > it would make more sense for everyone else to opt out of minusinfkey > semantics, and to get the !minusinfkey optimization as a result of > that. I only did it the other way around because that meant that only > nbtpage.c had to acknowledge the special case.
This seems like a good idea -- we should reframe the !minusinfkey optimization, without actually changing the behavior. Flip it around. The minusinfkey field within the insertion scankey struct would be called something like "descendrighttrunc" instead. Same idea, but with the definition inverted. Most _bt_search() callers (all of those outside of nbtpage.c and amcheck) would be required to opt in to that optimization to get it. Under this arrangement, nbtpage.c/page deletion would not ask for the "descendrighttrunc" optimization, and would therefore continue to do what it has always done: find the first leaf page that its insertion scankey values could be on (we don't lie about searching for negative infinity, or having a negative infinity sentinel value in scan key). The only difference for page deletion between v3 indexes and v4 indexes is that with v4 indexes we'll relocate the same leaf page reliably, since every separator key value is guaranteed to be unique on its level (including the leaf level/leaf high keys). This is just a detail, though, and not one that's even worth pointing out; we're not *relying* on that being true on v4 indexes anyway (we check that the block number is a match too, which is strictly necessary for v3 indexes and seems like a good idea for v4 indexes). This is also good because it makes it clear that the unique index code within _bt_doinsert() (that temporarily sets scantid to NULL) benefits from the descendrighttrunc/!minusinfkey optimization -- it should be "honest" and ask for it explicitly. We can make _bt_doinsert() opt in to the optimization for unique indexes, but not for other indexes, where scantid is set from the start. The descendrighttrunc/!minusinfkey optimization cannot help when scantid is set from the start, because we'll always have an attribute value in insertion scankey that breaks the tie for us instead. We'll always move right of a heap-TID-truncated separator key whose untruncated attributes are all equal to a prefix of our insertion scankey values. (This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for unique indexes matters more than you might think -- we do really badly with things like Zipfian distributions currently, and reducing the contention goes some way towards helping with that. Postgres pro noticed this a couple of years back, and analyzed it in detail at that time. It's really nice that we very rarely have to move right within code like _bt_check_unique() and _bt_findsplitloc() with the patch.) Does that make sense to you? Can you live with the name "descendrighttrunc", or do you have a better one? Thanks -- Peter Geoghegan