Re: Index range search optimization

Alexander Korotkov Fri, 22 Sep 2023 07:25:23 -0700

Hi Peter,
Hi Pavel,

The v4 of the patch is attached.


On Thu, Sep 21, 2023 at 11:48 PM Peter Geoghegan <p...@bowt.ie> wrote:
>
> On Thu, Sep 21, 2023 at 5:11 AM Pavel Borisov <pashkin.e...@gmail.com> wrote:
> > I looked at the patch code and I agree with this optimization.
> > Implementation also looks good to me except change :
> > + if (key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD) &&
> > + !(key->sk_flags & SK_ROW_HEADER))
> > + requiredDir = true;
> > ...
> > - if ((key->sk_flags & SK_BT_REQFWD) &&
> > - ScanDirectionIsForward(dir))
> > - *continuescan = false;
> > - else if ((key->sk_flags & SK_BT_REQBKWD) &&
> > - ScanDirectionIsBackward(dir))
> > + if (requiredDir)
> >   *continuescan = false;
> >
> > looks like changing behavior in the case when key->sk_flags &
> > SK_BT_REQFWD && (! ScanDirectionIsForward(dir)) &&
> > (!requiredDirMatched)
> > Originally it doesn't set *continuescan = false; and with the patch it will 
> > set.
>
> I agree that this is a problem. Inequality strategy scan keys are used
> when the initial positioning strategy used by _bt_first (for its
> _bt_search call) is based on an operator other than the "=" operator
> for the opclass. These scan keys are required in one direction only
> (Konstantin's original patch just focussed on these cases, actually).
> Obviously, that difference matters. I don't think that this patch
> should do anything that even looks like it might be revising the
> formal definition of "required in the current scan direction".

Sorry, that was messed up from various attempts to write the patch.
Actually, I end up with two boolean variables indicating whether the
current key is required for the same direction or opposite direction
scan.  I believe that the key required for the opposite direction scan
should be already satisfied by _bt_first() except for NULLs case.
I've implemented a skip of calling the key function for this case
(with assert that result is the same).

> Why is SK_ROW_HEADER treated as a special case by the patch? Could it
> be related to the issues with required-ness and scan direction? Note
> that we never use BTEqualStrategyNumber for SK_ROW_HEADER scan key row
> comparisons, so they're only ever required for one scan direction.
> (Equality-type row constructor syntax can of course be used without
> preventing the system from using an index scan, but the nbtree code
> will not see that case as a row comparison in the first place. This is
> due to preprocessing by the planner -- nbtree just sees conventional
> scan keys with multiple simple equality scan keys with = row
> comparisons.)

The thing is that NULLs could appear in the middle of matching values.

# WITH t (a, b) AS (VALUES ('a', 'b'), ('a', NULL), ('b', 'a'))
SELECT a, b, (a, b) > ('a', 'a') FROM t ORDER BY (a, b);
 a |  b   | ?column?
---+------+----------
 a | b    | t
 a | NULL | NULL
 b | a    | t
(3 rows)

So we can't just skip the row comparison operator, because we can meet
NULL at any place.

> > This may be relevant for the first page when requiredDirMatched is
> > intentionally skipped to be set and for call
> > _bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
>
> Also, requiredDirMatched isn't initialized by _bt_readpage() when
> "so->firstPage". Shouldn't it be initialized to false?
>
> Also, don't we need to take more care with a fully empty page? The "if
> (!so->firstPage) ... " block should be gated using a condition such as
> "if (!so->firstPage && minoff < maxoff)". (Adding a "minoff <= maxoff"
> test would also work, but then the optimization will get applied on
> pages with only one non-pivot tuple. That would be harmless, but a
> waste of cycles.)

This makes sense.  I've added (minoff < maxoff) to the condition.

> > Also naming of requiredDirMatched and requiredDir seems semantically
> > hard to understand the meaning without looking at the patch commit
> > message. But I don't have better proposals yet, so maybe it's
> > acceptable.
>
> I agree. How about "requiredMatchedByPrecheck" instead of
> "requiredDirMatched", and "required" instead of "requiredDir"?
>
> It would be nice if this patch worked in a way that could be verified
> by an assertion. Under this scheme, the optimization would only really
> be used in release builds (builds without assertions enabled, really).
> We'd only verify that the optimized case agreed with the slow path in
> assert-enabled builds. It might also make sense to always "apply the
> optimization" on assert-enabled builds, even for the first page seen
> by _bt_readpage by any _bt_first-wise scan. Maybe this sort of
> approach is impractical here for some reason, but I don't see why it
> should be.

Yes, this makes sense.  I've added an assert check that results are
the same as with requiredMatchedByPrecheck == false.

> Obviously, the optimization should lower the amount of work in some
> calls to _bt_checkkeys, without ever changing the answer _bt_checkkeys
> gives. Ideally, it should be done in a way that makes that very
> obvious. There are some very subtle interactions between _bt_checkkeys
> and other, distant code -- which makes me feel paranoid. Notably, with
> required equality strategy scan keys, we're crucially dependent on
> _bt_first using an equality strategy for its initial positioning call
> to _bt_search. This is described by comments in both _bt_checkkeys and
> in _bt_first.
>
> Note, in particular, that it is essential that the initial offnum
> passed to _bt_readpage doesn't allow a call to _bt_checkkeys to take
> place that could cause it to become confused by a required equality
> strategy scan key, leading to _bt_checkkeys terminating the whole scan
> "early" -- causing wrong answers. For a query "WHERE foo = 5" (and a
> forward scan), we had better not pass _bt_readpage an offset number
> for a tuple with "foo" value 4. If that is ever allowed then
> _bt_checkkeys will terminate the scan immediately, leading to wrong
> answers. All because _bt_checkkeys can't tell if 4 comes before 5 or
> comes after 5 -- it only has an "=" operator to work with, so it can't
> actually make this distinction, so it likes to assume that anything !=
> 5 must come after 5 (or before 5 during a backwards scan).
>
> I added a very similar _bt_compare()-based assertion in
> _bt_check_unique(), which went on to catch a very subtle bug in the
> Postgres 12 nbtree work -- the bug fixed by commit 74eb2176bf. So I
> have put this particular idea about asserting agreement between a fast
> path and a slow comparison path into practice already.

Good, thank you for the detailed clarification.

------
Regards,
Alexander Korotkov

From 5afb4c1181c8d017ad854e54a4a4c75348e03334 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Sep 2023 13:18:09 +0300
Subject: [PATCH] Skip checking of scan keys required for directional scan in
 B-tree

Currently, B-tree code matches every scan key to every item on the page.
Imagine the ordered B-tree scan for the query like this.

SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col;

The (col > 'a') scan key will be always matched once we find the location to
start the scan.  The (col < 'b') scan key will match every item on the page
as long as it matches the last item on the page.

This patch implements prechecking of the scan keys required for directional
scan on beginning of page scan.  If precheck is successful we can skip this
scan keys check for the items on the page.  That could lead to significant
acceleration especially if the comparison operator is expensive.

Idea from patch by Konstantin Knizhnik.

Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru
Reviewed-by: Peter Geoghegan
---
 src/backend/access/nbtree/nbtsearch.c | 61 +++++++++++++++++++++++++--
 src/backend/access/nbtree/nbtutils.c  | 59 ++++++++++++++++++++------
 src/include/access/nbtree.h           |  6 ++-
 3 files changed, 109 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 17ad89749d5..b8435729d9f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1429,6 +1429,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned, if any */
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
+	so->firstPage = true;
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1539,6 +1540,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	bool		requiredMatchedByPrecheck;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1592,6 +1594,38 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 */
 	Assert(BTScanPosIsPinned(so->currPos));
 
+	/*
+	 * Prechecking the page with scan keys required for direction scan.  We
+	 * check these keys with the last item on the page (according to our scan
+	 * direction).  If these keys are matched, we can skip checking them with
+	 * every item on the page.  Scan keys for our scan direction would
+	 * necessarily match the previous items.  Scan keys required for opposite
+	 * direction scan are already matched by the _bt_first() call.
+	 *
+	 * With the forward scan, we do this check for the last item on the page
+	 * instead of the high key.  It's relatively likely that the most
+	 * significant column in the high key will be different from the
+	 * corresponding value from the last item on the page.  So checking with
+	 * the last item on the page would give a more precise answer.
+	 *
+	 * We skip this for the first page in the scan to evade the possible
+	 * slowdown of the point queries.
+	 */
+	if (!so->firstPage && minoff < maxoff)
+	{
+		ItemId		iid;
+		IndexTuple	itup;
+
+		iid = PageGetItemId(page, ScanDirectionIsForward(dir) ? maxoff : minoff);
+		itup = (IndexTuple) PageGetItem(page, iid);
+		(void) _bt_checkkeys(scan, itup, indnatts, dir,
+							 &requiredMatchedByPrecheck, false);
+	}
+	else
+	{
+		so->firstPage = false;
+	}
+
 	if (ScanDirectionIsForward(dir))
 	{
 		/* load items[] in ascending order */
@@ -1603,6 +1637,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	itup;
+			bool		passes_quals;
 
 			/*
 			 * If the scan specifies not to return killed tuples, then we
@@ -1616,7 +1651,18 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+										 &continuescan, requiredMatchedByPrecheck);
+
+			/*
+			 * If the result of prechecking required keys was true, then in
+			 * assert-enabled builds we also recheck that _bt_checkkeys()
+			 * result is is the same.
+			 */
+			Assert(!requiredMatchedByPrecheck ||
+				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
+												 &continuescan, false));
+			if (passes_quals)
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
@@ -1673,7 +1719,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, false);
 		}
 
 		if (!continuescan)
@@ -1725,7 +1771,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			itup = (IndexTuple) PageGetItem(page, iid);
 
 			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+										 &continuescan, requiredMatchedByPrecheck);
+
+			/*
+			 * If the result of prechecking required keys was true, then in
+			 * assert-enabled builds we also recheck that _bt_checkkeys()
+			 * result is is the same.
+			 */
+			Assert(!requiredMatchedByPrecheck ||
+				   passes_quals == _bt_checkkeys(scan, itup, indnatts, dir,
+												 &continuescan, false));
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7da499c4dd5..a2eb578b0d2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1357,10 +1357,13 @@ _bt_mark_scankey_required(ScanKey skey)
  * tupnatts: number of attributes in tupnatts (high key may be truncated)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
+ * requiredDirMatched: indicates that scan keys required for direction scan
+ *					   are already matched
  */
 bool
 _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+			  ScanDirection dir, bool *continuescan,
+			  bool requiredMatchedByPrecheck)
 {
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
@@ -1381,6 +1384,29 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		Datum		datum;
 		bool		isNull;
 		Datum		test;
+		bool		requiredSameDir = false,
+					requiredOppositeDir = false;
+
+		/*
+		 * Check if the key is required for ordered scan in the same or
+		 * opposite direction.  Save as flag variables for future usage.
+		 */
+		if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsForward(dir)) ||
+			((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
+			requiredSameDir = true;
+		else if (((key->sk_flags & SK_BT_REQFWD) && ScanDirectionIsBackward(dir)) ||
+				 ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsForward(dir)))
+			requiredOppositeDir = true;
+
+		/*
+		 * Is the key required for scanning for either forward or backward
+		 * direction?  If so and called told us that these types of keys are
+		 * known to be matched, skip the check.  Except for the row keys,
+		 * where NULLs could be found in the middle of matching values.
+		 */
+		if ((requiredSameDir || requiredOppositeDir) &&
+			!(key->sk_flags & SK_ROW_HEADER) && requiredMatchedByPrecheck)
+			continue;
 
 		if (key->sk_attno > tupnatts)
 		{
@@ -1429,11 +1455,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * scan direction, then we can conclude no further tuples will
 			 * pass, either.
 			 */
-			if ((key->sk_flags & SK_BT_REQFWD) &&
-				ScanDirectionIsForward(dir))
-				*continuescan = false;
-			else if ((key->sk_flags & SK_BT_REQBKWD) &&
-					 ScanDirectionIsBackward(dir))
+			if (requiredSameDir)
 				*continuescan = false;
 
 			/*
@@ -1483,8 +1505,23 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			return false;
 		}
 
-		test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
-								 datum, key->sk_argument);
+		/*
+		 * Apply the key checking function.  When the key is required for
+		 * opposite direction scan, it must be already satisfied by
+		 * _bt_first() except for the NULLs checking, which have already done
+		 * above.
+		 */
+		if (!requiredOppositeDir)
+		{
+			test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
+									 datum, key->sk_argument);
+		}
+		else
+		{
+			test = true;
+			Assert(test == FunctionCall2Coll(&key->sk_func, key->sk_collation,
+											 datum, key->sk_argument));
+		}
 
 		if (!DatumGetBool(test))
 		{
@@ -1498,11 +1535,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * initial positioning in _bt_first() when they are available. See
 			 * comments in _bt_first().
 			 */
-			if ((key->sk_flags & SK_BT_REQFWD) &&
-				ScanDirectionIsForward(dir))
-				*continuescan = false;
-			else if ((key->sk_flags & SK_BT_REQBKWD) &&
-					 ScanDirectionIsBackward(dir))
+			if (requiredSameDir)
 				*continuescan = false;
 
 			/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5c66964ca0..66e6a7775bf 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1054,6 +1054,9 @@ typedef struct BTScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* flag inficating the first page in the scan */
+	bool		firstPage;
+
 	/*
 	 * If we are doing an index-only scan, these are the tuple storage
 	 * workspaces for the currPos and markPos respectively.  Each is of size
@@ -1253,7 +1256,8 @@ extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
 extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+						  int tupnatts, ScanDirection dir, bool *continuescan,
+						  bool requiredMatched);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
-- 
2.37.1 (Apple Git-137.1)

Re: Index range search optimization

Reply via email to