Andres and I discussed bottlenecks in the nbtree code during the
recent PgDay SF community event. Andres observed that the call to
BTreeTupleGetNAtts() in _bt_compare() can become somewhat of a
bottleneck. I came up with the very simple attached POC-quality
patches, which Andres tested and profiled with his original complaint
in mind yesterday. He reported increased throughput on a memory
bound simple pgbench SELECT-only workload.

He reported that the problem went away with the patches applied. The
following pgbench SELECT-only result was sent to me privately:

before:
single:         tps = 30300.202363 (excluding connections establishing)
all cores:      tps = 1026853.447047 (excluding connections establishing)

after:
single:         tps = 31120.452919 (excluding connections establishing)
all cores:      tps = 1032376.361006 (excluding connections establishing)

(Actually, he tested something slightly different -- he inlined
_bt_compare() in his own way instead of using my 0002-*, and didn't
use the 0003-* optimization at all.)

Apparently this was a large multi-socket machine. Those are hard to
come by.

The main idea here is to make _bt_compare() delay
calling BTreeTupleGetNAtts() until the point after the first attribute
turns out to be equal to the _bt_compare() caller's insertion scankey.
Many or most calls to _bt_compare() won't even need to call
BTreeTupleGetNAtts().

This relies on the assumption that any tuple must have at least one
untruncated suffix column in the _bt_compare() loop. It doesn't matter
whether it's a pivot or non-pivot tuple -- the representation of the
first column will be exactly the same.

The negative infinity item on an internal page always has zero
attributes, which might seem like a snag. However, we already treat
that as a special case within _bt_compare(), for historical reasons
(pg_upgrade'd indexes won't have the number of attributes explicitly
set to zero in some cases).

Another separate though related idea in 0003-* is to remove the
negative infinity check. It goes from _bt_compare() to _bt_binsrch(),
since it isn't something that we need to consider more than once per
page -- and only on internal pages. That way, _bt_compare() doesn't
have to look at the page special area to check if it's a leaf page or
an internal page at all. I haven't really profiled this just yet. This is
one of those patches where 95%+ of the work is profiling and
benchmarking.

Andres and I both agree that there is a lot more work to be done in
this area, but that will be a major undertaking. I am quite keen on
the idea of repurposing the redundant-to-nbtree ItemId.lp_len field to
store an abbreviated key. Making that work well is a considerable
undertaking, since you need to use prefix compression to get a high
entropy abbreviated key. It would probably take me the best part of a
whole release cycle to write such a patch. The attached patches get
us a relatively easy win in the short term, though.

Thoughts?
--
Peter Geoghegan

Attachment: v1-0001-Avoid-calling-BTreeTupleGetNAtts-in-_bt_compare.patch
Description: Binary data

Attachment: v1-0003-Remove-negative-infinity-check-from-_bt_compare.patch
Description: Binary data

Attachment: v1-0002-Inline-_bt_compare.patch
Description: Binary data

Reply via email to