Attached patch teaches nbtree backwards scans to avoid needlessly
relocking a previously read page/buffer at the point where we need to
consider reading the next page (the page to the left).

Currently, we do this regardless of whether or not we already decided
to end the scan, back when we read the page within _bt_readpage. We'll
relock the page we've just read (and just unlocked), just to follow
its left link, while making sure to deal with concurrent page splits
correctly. But why bother relocking the page, or with thinking about
concurrent splits, if we can already plainly see that we cannot
possibly need to find matching tuples on the left sibling page?

The patch just adds a simple precheck, which works in the obvious way.
Seems like this was a missed opportunity for commit 2ed5b87f96.

On HEAD, the following query requires 24 buffer hits (I'm getting a
standard/forward index scan for this):

select
  abalance
from
  pgbench_accounts
where
  aid in (1, 500, 1000, 1500, 2000, 3000) order by aid asc;

However, we don't see that with the backwards scan variant:

select
  abalance
from
  pgbench_accounts
where
  aid in (1, 500, 1000, 1500, 2000, 3000) order by aid desc;

We actually see 30 buffer hits for this on HEAD. Whereas with the
patch, both variants get only 24 buffer hits -- there's parity, at
least in cases like these two.

Note that we only "achieve parity" here because we happen to be using
a SAOP, requiring multiple primitive index scans, each of which ends
with its own superfluous lock acquisition. Backwards scans remain at a
disadvantage with regard to buffer locks acquired in other cases --
cases that happen to involve scanning several sibling leaf pages in
sequence (no change there).

It's probably possible to fix those more complicated cases too. But
that would require a significantly more complicated approach. I
haven't touched existing comments in _bt_readnextpage that contemplate
this possibility.

-- 
Peter Geoghegan

Attachment: v1-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-lock.patch
Description: Binary data

Reply via email to