The C collation is treated exactly the same as other collations when considering whether the generation of abbreviated keys for text should continue. This doesn't make much sense. With text, the big cost that we are concerned about going to waste should abbreviated keys not capture sufficient entropy is the cost of n strxfrm() calls. However, the C collation doesn't use strxfrm() -- it uses memcmp(), which is far cheaper.
With other types, like numeric and now UUID, the cost of generating an abbreviated key is significantly lower than text when using collations other than the C collation. Their cost models reflect this, and abort abbreviation far less aggressively than text's, even though the trade-off is very similar when text uses the C collation. Attached patch fixes this inconsistency by making it significantly less likely that abbreviation will be aborted when the C collation is in use. The behavior with other collations is unchanged. This should be backpatched to 9.5 as a bugfix, IMV. -- Peter Geoghegan
From c87da5330c636e939983b1ba8eaee581b4c953dd Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Sun, 29 Nov 2015 12:51:36 -0800 Subject: [PATCH] Abort C collation text abbreviation less frequently Discriminate against the C collation by creating a much lower bar for the amount of entropy that abbreviated keys must capture. This is consistent with existing cases that have cheaper conversion processes, like UUID. Backpatch to 9.5, where abbreviated keys for text were added. --- src/backend/utils/adt/varlena.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index a89f586..0bcdd96 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -1869,7 +1869,7 @@ btsortsupport_worker(SortSupport ssup, Oid collid) */ if (abbreviate) { - tss->prop_card = 0.20; + tss->prop_card = collate_c ? 0.01 : 0.20; initHyperLogLog(&tss->abbr_card, 10); initHyperLogLog(&tss->full_card, 10); ssup->abbrev_full_comparator = ssup->comparator; @@ -2261,7 +2261,11 @@ bttext_abbrev_abort(int memtupcount, SortSupport ssup) * cardinality against the overall size of the set in order to more * accurately model costs. Assume that an abbreviated comparison, and an * abbreviated comparison with a cheap memcmp()-based authoritative - * resolution are equivalent. + * resolution are equivalent. (With the C collation, authoritative + * cardinality is used in the same way, even though the cost of an + * authoritative tie-breaker is no cheaper when values are equal. The + * theory is that the early appearance of low entropy abbreviated keys + * predicts the same prefix for all or most values.) */ if (abbrev_distinct > key_distinct * tss->prop_card) { -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers