The C collation is treated exactly the same as other collations when
considering whether the generation of abbreviated keys for text should
continue. This doesn't make much sense. With text, the big cost that
we are concerned about going to waste should abbreviated keys not
capture sufficient entropy is the cost of n strxfrm() calls. However,
the C collation doesn't use strxfrm() -- it uses memcmp(), which is
far cheaper.

With other types, like numeric and now UUID, the cost of generating an
abbreviated key is significantly lower than text when using collations
other than the C collation. Their cost models reflect this, and abort
abbreviation far less aggressively than text's, even though the
trade-off is very similar when text uses the C collation.

Attached patch fixes this inconsistency by making it significantly
less likely that abbreviation will be aborted when the C collation is
in use. The behavior with other collations is unchanged. This should
be backpatched to 9.5 as a bugfix, IMV.

-- 
Peter Geoghegan
From c87da5330c636e939983b1ba8eaee581b4c953dd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sun, 29 Nov 2015 12:51:36 -0800
Subject: [PATCH] Abort C collation text abbreviation less frequently

Discriminate against the C collation by creating a much lower bar for
the amount of entropy that abbreviated keys must capture.  This is
consistent with existing cases that have cheaper conversion processes,
like UUID.

Backpatch to 9.5, where abbreviated keys for text were added.
---
 src/backend/utils/adt/varlena.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index a89f586..0bcdd96 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1869,7 +1869,7 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
 		 */
 		if (abbreviate)
 		{
-			tss->prop_card = 0.20;
+			tss->prop_card = collate_c ? 0.01 : 0.20;
 			initHyperLogLog(&tss->abbr_card, 10);
 			initHyperLogLog(&tss->full_card, 10);
 			ssup->abbrev_full_comparator = ssup->comparator;
@@ -2261,7 +2261,11 @@ bttext_abbrev_abort(int memtupcount, SortSupport ssup)
 	 * cardinality against the overall size of the set in order to more
 	 * accurately model costs.  Assume that an abbreviated comparison, and an
 	 * abbreviated comparison with a cheap memcmp()-based authoritative
-	 * resolution are equivalent.
+	 * resolution are equivalent.  (With the C collation, authoritative
+	 * cardinality is used in the same way, even though the cost of an
+	 * authoritative tie-breaker is no cheaper when values are equal.  The
+	 * theory is that the early appearance of low entropy abbreviated keys
+	 * predicts the same prefix for all or most values.)
 	 */
 	if (abbrev_distinct > key_distinct * tss->prop_card)
 	{
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to