Andres and I discussed bottlenecks in the nbtree code during the recent PgDay SF community event. Andres observed that the call to BTreeTupleGetNAtts() in _bt_compare() can become somewhat of a bottleneck. I came up with the very simple attached POC-quality patches, which Andres tested and profiled with his original complaint in mind yesterday. He reported increased throughput on a memory bound simple pgbench SELECT-only workload.
He reported that the problem went away with the patches applied. The following pgbench SELECT-only result was sent to me privately: before: single: tps = 30300.202363 (excluding connections establishing) all cores: tps = 1026853.447047 (excluding connections establishing) after: single: tps = 31120.452919 (excluding connections establishing) all cores: tps = 1032376.361006 (excluding connections establishing) (Actually, he tested something slightly different -- he inlined _bt_compare() in his own way instead of using my 0002-*, and didn't use the 0003-* optimization at all.) Apparently this was a large multi-socket machine. Those are hard to come by. The main idea here is to make _bt_compare() delay calling BTreeTupleGetNAtts() until the point after the first attribute turns out to be equal to the _bt_compare() caller's insertion scankey. Many or most calls to _bt_compare() won't even need to call BTreeTupleGetNAtts(). This relies on the assumption that any tuple must have at least one untruncated suffix column in the _bt_compare() loop. It doesn't matter whether it's a pivot or non-pivot tuple -- the representation of the first column will be exactly the same. The negative infinity item on an internal page always has zero attributes, which might seem like a snag. However, we already treat that as a special case within _bt_compare(), for historical reasons (pg_upgrade'd indexes won't have the number of attributes explicitly set to zero in some cases). Another separate though related idea in 0003-* is to remove the negative infinity check. It goes from _bt_compare() to _bt_binsrch(), since it isn't something that we need to consider more than once per page -- and only on internal pages. That way, _bt_compare() doesn't have to look at the page special area to check if it's a leaf page or an internal page at all. I haven't really profiled this just yet. This is one of those patches where 95%+ of the work is profiling and benchmarking. Andres and I both agree that there is a lot more work to be done in this area, but that will be a major undertaking. I am quite keen on the idea of repurposing the redundant-to-nbtree ItemId.lp_len field to store an abbreviated key. Making that work well is a considerable undertaking, since you need to use prefix compression to get a high entropy abbreviated key. It would probably take me the best part of a whole release cycle to write such a patch. The attached patches get us a relatively easy win in the short term, though. Thoughts? -- Peter Geoghegan
v1-0001-Avoid-calling-BTreeTupleGetNAtts-in-_bt_compare.patch
Description: Binary data
v1-0003-Remove-negative-infinity-check-from-_bt_compare.patch
Description: Binary data
v1-0002-Inline-_bt_compare.patch
Description: Binary data