Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Peter Geoghegan Fri, 05 Sep 2014 19:46:20 -0700

On Wed, Sep 3, 2014 at 2:44 PM, Peter Geoghegan <p...@heroku.com> wrote:
> I guess it should still be a configure option, then. Or maybe there
> should just be a USE_ABBREV_KEYS macro within pg_config_manual.h.

Attached additional patches are intended to be applied on top off most
of the patches posted on September 2nd [1]. Note that you should not
apply patch 0001-* from that set to master, since it has already been
committed to master [2]. However, while rebasing I revised
patch/commit 0005-* to abbreviation used on all platforms, including
32-bit platforms (the prior 0005-* patch just re-enabled the
optimization on Darwin/Apple), so you should discard the earlier
0005-* patch. In a later commit I also properly formalize the idea
that we always do opportunistic "memcmp() == 0" checks, no matter what
context a sortsupport-accelerated text comparison occurs in. That
seems like a good idea, but it's broken out in a separate commit in
case you are not in agreement.

While I gave serious consideration to your idea of having a dedicated
abbreviation comparator, and not duplicating sortsupport state when
abbreviated keys are used (going so far as to almost fully implement
the idea), I ultimately decided that my vote says we don't do that. It
seemed to me that there were negligible benefits for increased
complexity. In particular, I didn't want to burden tuplesort with
having to worry about whether or not abbreviation was aborted during
tuple copying, or was not used by the opclass in the first place -
implementing your scheme makes that distinction relevant. It's very
convenient to have comparetup_heap() "compare the leading sort key"
(that specifically looks at SortTuple.datum1 pairs) indifferently,
using the same comparator for "abbreviated" and "not abbreviated"
cases indifferently. comparetup_heap() does not seem like a great
place to burden with caring about each combination any more than
strictly necessary.

I like that I don't have to care about every combination, and can
treat abbreviation abortion as the special case with the extra step,
in line with how I think of the optimization conceptually. Does that
make sense? Otherwise, there'd have to be a ApplySortComparator()
*and* "ApplySortComparatorAbbreviated()" call with SortTuple.datum1
pairs passed, as appropriate for each opclass (and abortion state), as
well as a heap_getattr() tie-breaker call for the latter case alone
(when we got an inconclusive answer, OR when abbreviation was
aborted). Finally, just as things are now, there'd have to be a loop
where the second or subsequent attributes are dealt with by
ApplySortComparator()'ing. So AFAICT under your scheme there are 4
ApplySortComparator* call sites required, rather than 3 as under mine.

Along similar lines, I thought about starting from nkey = 0 within
comparetup_heap() when abortion occurs (so that there'd only be 2
ApplySortComparator() call sites - no increase from master) , but that
turns out to be messy, plus I like those special tie-breaker
assertions.

I will be away for much of next week, and will have limited access to
e-mail. I will be around tomorrow, though. I hope that what I've
posted is suitable to commit without further input from me.

[1]
http://www.postgresql.org/message-id/cam3swztetqckc24lhwkdlasjf-b-ccnn4q0oyjhgbx+ncpn...@mail.gmail.com
[2]
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d8d4965dc29263462932be03d4206aa694e2cd7e
--
Peter Geoghegan

From 889d615218d5cac9097c11bec33e2d8e70112922 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@heroku.com>
Date: Fri, 5 Sep 2014 16:43:56 -0700
Subject: [PATCH 7/7] Soften wording about nodeAgg.c single column support

Support for forcing nodeAgg.c to use a heap tuplesort rather than a
Datum tuplesort, even when sorting on a single attribute may be useful.
Datum tuplesorts do not and cannot support abbreviated keys, since Datum
tuplesorting necessitates preserving the original representation for
later re-use by certain datum tuplesort clients.  When using a heap
tuplesort will make the difference between using and not using
abbreviated keys, it is likely significantly faster to prefer a heap
tuplesort.  On the other hand, nodeAgg.c's current preference for a
datum tuplesort is likely to result in superior performance for
non-abbreviated cases.

Since it is now apparent that support for forcing a heap tuplesort
within nodeAgg.c will follow in a later, separately reviewed patch,
soften the wording in comments within nodeAgg.c to reflect this.  This
doesn't seem unreasonable, since sortsupport optimization is already
used inconsistently.
---
 src/backend/executor/nodeAgg.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3b335e7..95143c3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -365,9 +365,9 @@ initialize_aggregates(AggState *aggstate,
 			 * otherwise sort the full tuple.  (See comments for
 			 * process_ordered_aggregate_single.)
 			 *
-			 * FIXME: It ought to be useful to force tuplesort_begin_heap()
-			 * case where the abbreviated key optimization can thereby be used,
-			 * even when numInputs == 1.
+			 * In the future, we should consider forcing the
+			 * tuplesort_begin_heap() case when the abbreviated key
+			 * optimization can thereby be used, even when numInputs is 1.
 			 */
 			peraggstate->sortstate =
 				(peraggstate->numInputs == 1) ?
-- 
1.9.1

From d7ac8850819b13eddc1aa34a9ece31ca568fbfa7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@heroku.com>
Date: Fri, 5 Sep 2014 16:00:58 -0700
Subject: [PATCH 6/7] Don't pursue ABBREVIATED_KEYS_TIE optimization

ABBREVIATED_KEYS_TIE was previously exposed by the abbreviated key
infrastructure on the assumption that it was useful to give opclasses
the ability to differentiate between the case when a non-abbreviated
comparison was required due to an abbreviated comparison being
inconclusive, and the case where there was no such leading inconclusive
abbreviated comparison to begin with (e.g. on a comparison of the second
or subsequent attribute in a multi-column sort).

Evidently, at least in the case of text's default opclass (the only
opclass that currently supports abbreviated keys), it is preferable to
always opportunistically attempt to get away with a "memcmp() == 0"
comparison (when the compared text datums are of matching length),
regardless of what degree of confidence we have in a successful outcome.
Testing of non-sympathetic cases has shown no appreciable penalty, while
sympathetic cases, such as multi-column sorts of correlated attributes,
show considerable performance improvements.  Since there is no longer
any distinction made between calls made and not made following an
inconclusive abbreviated comparison, the abstract concept of
ABBREVIATED_KEYS_TIE appears useless, and so has been removed.
---
 src/backend/utils/adt/varlena.c    | 49 +++++++++++++++++++++++++-------------
 src/backend/utils/sort/tuplesort.c |  2 +-
 src/include/utils/sortsupport.h    | 19 ++++-----------
 3 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 92d9135..6aa3eaa 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1884,7 +1884,7 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
 	 * Initialize tiebreak sortsupport state with an authoritative
 	 * strcoll()-based comparison func for tie-breaking.
 	 */
-	ssup->abbrev_tiebreak->abbrev_state = ABBREVIATED_KEYS_TIE;
+	ssup->abbrev_tiebreak->abbrev_state = ABBREVIATED_KEYS_NO;
 	btsortsupport_worker(ssup->abbrev_tiebreak, collid);
 }
 
@@ -1965,26 +1965,41 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
 	len1 = VARSIZE_ANY_EXHDR(arg1);
 	len2 = VARSIZE_ANY_EXHDR(arg2);
 
-	if (ssup->abbrev_state == ABBREVIATED_KEYS_TIE && len1 == len2)
+	if (len1 == len2)
 	{
 		/*
-		 * Being called as authoritative tie-breaker for an abbreviated key
-		 * comparison.
+		 * Regardless of whether this is a call to authoritatively tie-break an
+		 * abbreviated key comparison, or a call with no preceding inconclusive
+		 * abbreviated key comparison (e.g. the call concerns the second or
+		 * subsequent attribute of a multi-column sort, where abbreviated keys
+		 * are never used), opportunistically attempt to resolve the comparison
+		 * with a cheap memcmp() that happens to indicate equality.
 		 *
-		 * TODO:  Consider using this optimization more frequently, perhaps
-		 * even entirely opportunistically.
+		 * With abbreviated keys, in general there is a pretty good chance
+		 * control reached here because the two keys are actually fully equal.
+		 * The ad-hoc costing model within bttext_abbrev_convert() ought to
+		 * ensure that there cannot be an excessive number of these
+		 * opportunistic calls that don't work out;  the model does not
+		 * distinguish between abbreviated comparisons and comparisons
+		 * successfully resolved here, allowing the abbreviated key
+		 * optimization to very effectively accelerate sorts on low cardinality
+		 * attributes.
 		 *
-		 * In general there is a pretty good chance control reached here
-		 * because the two keys are actually fully equal.  Try and give an
-		 * answer using only a cheap memcmp() comparison on the assumption that
-		 * this will indicate equality frequently enough for it to be worth it
-		 * on balance.  This is a reasonable assumption, since sorting is
-		 * almost certainly bottlenecked on memory latency.
-		 *
-		 * Original, authoritative key cardinality is weighed within
-		 * bttext_abbrev_abort().  Cases where attempts at resolving tie-breakers in
-		 * this manner are the usual outcome, and yet those attempts usually
-		 * fail should only arise infrequently.
+		 * Doing this for the non-abbreviated key case tends to help a lot with
+		 * multi-column sorts where columns are highly correlated, such as a
+		 * sort on (country, state, city) attributes -- when a tie-breaker
+		 * comparison on the second or subsequent attributes in sought, a
+		 * considerable proportion of those comparisons can be resolved here.
+		 * However, in the non-abbreviated key case this path is entirely (as
+		 * opposed to somewhat) opportunistic, and in general there is a very
+		 * low chance that this will work out.  Proceeding even in the
+		 * non-abbreviated case is justified by the fact that in practice, the
+		 * cost in non-sympathetic cases is often indistinguishable from zero.
+		 * Presumably, this is due to the effects of CPU pipelining.  The
+		 * bottleneck is very probably memory latency, and so on modern
+		 * hardware there is no appreciable penalty for the often futile
+		 * checking of the first few bytes of each of the pair of strings.
+		 * These bytes already needed to be stored in cachelines.
 		 */
 		if (memcmp(a1p, a2p, len1) == 0)
 		{
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index ea1252c..cabb632 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2946,7 +2946,7 @@ comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
 
 		Assert(attno == sortKey->abbrev_tiebreak->ssup_attno);
 		Assert(sortKey->abbrev_state == ABBREVIATED_KEYS_YES);
-		Assert(sortKey->abbrev_tiebreak->abbrev_state == ABBREVIATED_KEYS_TIE);
+		Assert(sortKey->abbrev_tiebreak->abbrev_state == ABBREVIATED_KEYS_NO);
 
 		datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
 		datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 7b84e0a..0fa75de 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -56,8 +56,7 @@
 typedef enum
 {
 	ABBREVIATED_KEYS_NO = 0,	/* Second/subsequent key, or no abbreviation capability */
-	ABBREVIATED_KEYS_YES ,		/* Leading (abbreviated applicable) key (when available) */
-	ABBREVIATED_KEYS_TIE,		/* "True" leading key, abbreviation tie-breaker */
+	ABBREVIATED_KEYS_YES		/* Leading (abbreviated applicable) key (when available) */
 } SortKeyAbbreviate;
 
 typedef struct SortSupportData *SortSupport;
@@ -147,22 +146,14 @@ typedef struct SortSupportData
 	 * sortsupport routine needs to know if its dealing with a leading key).
 	 * Even with a leading key, internal sortsupport clients like tuplesort may
 	 * pass ABBREVIATED_KEYS_NO because it isn't feasible to inject the
-	 * conversion routine.  However, ABBREVIATED_KEYS_TIE  means that it's a
-	 * "proper" sortsupport state, originally generated by the sortsupport
-	 * routine itself -- the core system will never directly create a tie-break
-	 * sort support state.  There is very little distinction between
-	 * ABBREVIATED_KEYS_NO and ABBREVIATED_KEYS_TIE, though.  The distinction
-	 * only exists to allow sortsupport routines to squeeze a bit more
-	 * performance from the knowledge that an authoritative tie-breaker
-	 * comparison is required because a prior alternative comparison didn't
-	 * work out (as opposed to being called without there ever being an
-	 * abbreviated comparison of the tuple attribute).
+	 * conversion routine.
 	 *
 	 * A sortsupport routine may itself initially decide against applying the
 	 * abbreviation optimization by setting a passed sort support state's
-	 * abbreviated state to ABBREVIATED_KEYS_NO.  This is typically used for
+	 * abbreviated state to ABBREVIATED_KEYS_NO.  This could be used for
 	 * platform-specific special cases where the optimization is thought to be
-	 * ineffective.
+	 * ineffective.  By convention, opclasses enable and disable abbreviation
+	 * in deference to USE_ABBREV_KEYS.
 	 */
 	SortKeyAbbreviate		abbrev_state;
 
-- 
1.9.1

From a07bd696a0ed933ca95c33f736a3a7e7eccecf0f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@heroku.com>
Date: Tue, 2 Sep 2014 18:55:05 -0700
Subject: [PATCH 5/7] Re-enable optimization on Darwin and 32-bit platforms

Add a macro to pg_config_manual.h that provides a relatively convenient
way to at least disable the optimization at compile time.
---
 src/backend/utils/adt/varlena.c | 39 ++++++++++++++-------------------------
 src/include/pg_config_manual.h  |  8 ++++++++
 2 files changed, 22 insertions(+), 25 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 062035f..92d9135 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1751,24 +1751,13 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
 
 	/*
 	 * On platforms where the abbreviated key for text optimization might have
-	 * bad worst case performance it is avoided entirely.  In order to ever
-	 * apply the optimization we require:
-	 *
-	 * 	* That the platform's strxfrm() meets a certain standard for
-	 * 	representing as much information as possible in leading bytes.
-	 *
-	 *	* That there are a full 8 bytes of storage per Datum on the platform,
-	 *	since we pack bytes into that representation.  Having only 4 bytes
-	 *	could make worse case performance drastically more likely.
-	 *
-	 * The standard applied for strxfrm() is that a significant amount of
-	 * entropy from the original string must be concentrated at the beginning
-	 * of returned blobs.  The Mac OS X implementation is known to produce
-	 * blobs with very little entropy;  it produces printable blobs with a
-	 * series of 24-bit weights, each one encoded into four bytes.  Multiple
-	 * "levels" of weights are separated by a weight of 0.  Typically only two
-	 * ASCII characters are represented in the first eight bytes.  Therefore,
-	 * the optimization is specially disabled.
+	 * bad worst case performance, it may be useful to avoid it entirely by
+	 * disabling it at compile time.  Having only 4 byte datums could make
+	 * worst-case performance drastically more likely, for example.  Moreover,
+	 * Darwin's strxfrm() implementations is known to not effectively
+	 * concentrate a significant amount of entropy from the original string in
+	 * earlier transformed blobs.  It's possible that other supported platforms
+	 * are similarly encumbered.
 	 *
 	 * Any reasonable implementation will pack primary weights into the start
 	 * of returned blobs.  The canonical algorithm's implementation is
@@ -1794,16 +1783,16 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
 	 *
 	 * It is generally expected that most non-equal keys will have their
 	 * comparisons resolved at the primary level.  If enough comparisons can be
-	 * resolved with just 8 byte abbreviated keys, this optimization is very
-	 * effective (although if there are many tie-breakers that largely only
-	 * perform cheap memcmp() calls, that is also much faster than the
+	 * resolved with just 4 or 8 byte abbreviated keys, this optimization is
+	 * very effective (although if there are many tie-breakers that largely
+	 * only perform cheap memcmp() calls, that is also much faster than the
 	 * unoptimized case - see bttext_abbrev_abort()).
 	 *
-	 * There is no reason to not at least perform fmgr elision on platforms
-	 * where abbreviation isn't expected to be profitable, though.
+	 * There is no reason to not at least perform fmgr elision on builds where
+	 * abbreviation is disabled, though.
 	 */
-#if defined(__darwin__) || SIZEOF_DATUM != 8
-	ssup->type = sortKeyOther;
+#ifndef USE_ABBREV_KEYS
+	ssup->abbrev_state = ABBREVIATED_KEYS_NO;
 #endif
 
 	/*
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index a27f5e3..02c43bf 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -304,6 +304,14 @@
 #define TRACE_SORT 1
 
 /*
+ * Enable "abbreviated key" optimization, for sortsupport B-Tree opclasses that
+ * additionally support abbreviated keys.  This is particularly relevant for
+ * sorting text, where platform-specific issues could limit the effectiveness
+ * of the optimization.
+ */
+#define USE_ABBREV_KEYS
+
+/*
  * Enable tracing of syncscan operations (see also the trace_syncscan GUC var).
  */
 /* #define TRACE_SYNCSCAN */
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Reply via email to