[HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

Peter Geoghegan Fri, 03 Jul 2015 17:34:48 -0700

Since apparently we're back to development work, I thought it was time
to share a patch implementing a few additional simple tricks to make
sorting text under a non-C locale even faster than in 9.5. These
techniques are mostly effective when values are physically clustered
together. This might be because there is a physical/logical
correlation, but cases involving any kind of clustering of values are
helped significantly.


Caching
======

The basic idea is that we cache strxfrm() blobs. Separately, we
exploit temporal locality and clustering of values by caching the
result of the most recent strcoll()-resolved comparison performed. The
strxfrm() technique helps a lot with low cardinality single attribute
sorts if we can avoid most strxfrm() work. On the other hand,
strcoll() comparison caching particularly helps with multi-attribute
sorts where there is a low to moderate cardinality secondary attribute
and low cardinality leading attribute. The master branch will still
opportunistically take the equality memcmp() fastpath plenty of times
for that second attribute, but there are no abbreviated keys to help
when that doesn't work out (because it isn't the leading attribute).

Regressions
==========

The patch still helps with strcoll() avoidance when the ordering of a
moderate cardinality attribute is totally random, but it helps much
less there. I have not seen a regression for any case. I'm expecting
someone to ask me to do something with the program I wrote last year,
to prove the opportunistic memcmp() equality fastpath for text is
"free" [1]. This patch has exactly the same tension as last year's
memcmp() equality one [2]: I add something opportunistic, that in
general might consistently not work out at all in some cases, and on
the face of it implies extra costs for those cases -- costs which must
be paid every single time. So as with the opportunistic memcmp()
equality thing, the *actual* overhead for cases that do not benefit
must be virtually zero for the patch to be worthwhile. That is the
standard that I expect that this patch will be held to, too.

Benchmark
=========

The query that I've been trying this out with is a typical rollup
query, using my "cities" sample data [3] (this is somewhat, although
not perfectly correlated on (country, province) before sorting):

postgres=# select country, province, count(*) from cities group by
rollup (country, province);
               country               |               province
      | count
-------------------------------------+---------------------------------------+--------
 Afghanistan                         | Badaẖšan
       |      5
 Afghanistan                         | Bādgīs
      |      2
 Afghanistan                         | Baġlān
      |      5
 Afghanistan                         | Balẖ
       |      6
 Afghanistan                         | Bāmiyān
      |      3
 Afghanistan                         | Farāh
      |      3
 Afghanistan                         | Fāryāb
      |      4
 Afghanistan                         | Ġawr
      |      3

*** SNIP *****
...

 Zimbabwe                            | Manicaland
      |     22
 Zimbabwe                            | Mashonaland Central
      |     13
 Zimbabwe                            | Mashonaland East
      |      9
 Zimbabwe                            | Mashonaland West
      |     21
 Zimbabwe                            | Masvingo
      |     11
 Zimbabwe                            | Matabeleland North
      |      8
 Zimbabwe                            | Matabeleland South
      |     14
 Zimbabwe                            | Midlands
      |     14
 Zimbabwe                            | [null]
      |    116
 [null]                              | [null]
      | 317102
(3529 rows)

With master, this takes about 525ms when it stabilizes after a few
runs on my laptop. With the patch, it takes about 405ms. That's almost
a 25% reduction in total run time. If I perform a more direct test of
sort performance against this data with minimal non-sorting overhead,
I see a reduction of as much as 30% in total query runtime (I chose
this rollup query because it is obviously representative of the real
world).

If this data is *perfectly* correlated (e.g. because someone ran
CLUSTER) and some sort can use the dubious "bubble sort best case"
path [4] that we added to qsort back in 2006, the improvement still
hold up at ~20%, I've found.

Performance of the "C" locale
---------------------------------------

For this particular rollup query, my 25% improvement leaves the
collated text sort perhaps marginally faster than an equivalent query
that uses the "C" locale (with or without the patch applied). It's
hard to be sure that that effect is real -- many trials are needed --
but it's reasonable to speculate that it's possible to sometimes beat
the "C" locale because of factors like final abbreviated key
cardinality.

It's easy to *contrive* a case where the "C" locale is beaten even
with 9.5 -- just sort a bunch of strings (that are abbreviated), that
look something like this:

"``..,,``..AAA"
"``..,,``..CCC"
"``..,,``..ZZZ"
"``..,,``..BBB"

Anyway, this avoidance of strxfrm() work I've introduced makes it
possible that abbreviated keys could make a strxfrm() locale-based
sort beat the "C" locale fair-and-square with a realistic dataset and
specific realistic query. That would be pretty nice, because that
can't be too far from optimal, and these cases are not uncommon.

A further idea -- unsigned integer comparisons
===================================

I've also changed text abbreviated keys to compare as unsigned
integers. On my Thinkpad laptop (which, of course, has an Intel CPU),
this makes a noticeable difference. memcmp() may be fast, but an
unsigned integer comparison is even faster (not sure if a big-endian
machine can have the existing memcmp() call optimized away, so that
effectively the same thing happens automatically).

Maybe other platforms benefit less, but it's very hard to imagine it
ever costing us anything.

[1] http://www.postgresql.org/message-id/5415a843.3070...@vmware.com
[2] Commit e246b3d6
[3] 
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump
[4] Commit a3f0b3d6
-- 
Peter Geoghegan

elide_ltrace_stats.txt.gz
Description: GNU Zip compressed data

From 1528a27f9a8331003a82cd645aabd876882b85ca Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Fri, 3 Jul 2015 13:48:08 -0700
Subject: [PATCH 2/2] Add several text sorting optimizations

Cache the results of strxfrm() calls made by the text abbreviation
routine, and strcoll() calls made by the text comparison routine (text
SortSupport authoritative comparator only).

Testing shows that the cost of doing this is immeasurably small in cases
where the optimization does not help, while the benefits in cases where
the optimization is effective can be quite significant.

Separately, arrange for text abbreviated keys to be represented as
unsigned integers, rather than a char array.  This is somewhat faster on
at least some platforms, since integer comparisons are now used rather
than memcmp() during sorting proper.
---
 src/backend/utils/adt/varlena.c | 118 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 105 insertions(+), 13 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2fbbf54..7a4eed7 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -61,6 +61,9 @@ typedef struct
 	char	   *buf2;			/* 2nd string, or abbreviation strxfrm() buf */
 	int			buflen1;
 	int			buflen2;
+	int			last_len1;		/* Length of last buf1 string/strxfrm() blob */
+	int			last_len2;		/* Length of last buf2 string/strxfrm() blob */
+	int			last_returned;	/* Last comparison result (cache) */
 	bool		collate_c;
 	hyperLogLogState abbr_card; /* Abbreviated key cardinality state */
 	hyperLogLogState full_card; /* Full key cardinality state */
@@ -71,6 +74,23 @@ typedef struct
 } TextSortSupport;
 
 /*
+ * Macro generating new representation of char array packed into Datum, making
+ * it a sizeof(Datum)-wide uint (the C type Datum is an unsigned integer).
+ * Comparisons of this representation should behave equivalently to a memcmp()
+ * of the Datum's entire original contents.  This offers a measurable
+ * performance boost on some systems.
+ */
+#ifdef WORDS_BIGENDIAN
+#define string_uint(x)	(x)
+#else									/* !WORDS_BIGENDIAN */
+#if SIZEOF_DATUM == 8
+#define string_uint(x)	BSWAP64(x)
+#else									/* SIZEOF_DATUM != 8 */
+#define string_uint(x)	BSWAP32(x)
+#endif									/* SIZEOF_DATUM == 8 */
+#endif									/* WORDS_BIGENDIAN */
+
+/*
  * This should be large enough that most strings will fit, but small enough
  * that we feel comfortable putting it on the stack
  */
@@ -1831,9 +1851,20 @@ btsortsupport_worker(SortSupport ssup, Oid collid)
 		tss->buflen1 = TEXTBUFLEN;
 		tss->buf2 = palloc(TEXTBUFLEN);
 		tss->buflen2 = TEXTBUFLEN;
+		/* Start with invalid values */
+		tss->last_len1 = -1;
+		tss->last_len2 = -1;
 #ifdef HAVE_LOCALE_T
 		tss->locale = locale;
 #endif
+		/*
+		 * To avoid somehow confusing a strxfrm() blob and an original string
+		 * within bttextfastcmp_locale(), initialize last returned cache to a
+		 * sentinel value.  A platform-specific actual strcoll() return value
+		 * of INT_MIN seems unlikely, but if that occurs it will not cause
+		 * wrong answers.
+		 */
+		tss->last_returned = INT_MIN;
 		tss->collate_c = collate_c;
 		ssup->ssup_extra = tss;
 
@@ -1896,6 +1927,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
 {
 	text	   *arg1 = DatumGetTextPP(x);
 	text	   *arg2 = DatumGetTextPP(y);
+	bool		arg1_match;
 	TextSortSupport *tss = (TextSortSupport *) ssup->ssup_extra;
 
 	/* working state */
@@ -1914,6 +1946,11 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
 	/* Fast pre-check for equality, as discussed in varstr_cmp() */
 	if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
 	{
+		/*
+		 * No change in buf1 or buf2 contents, so avoid changing last_len1 or
+		 * last_len2.  Existing contents of buffers might still be used by next
+		 * call.
+		 */
 		result = 0;
 		goto done;
 	}
@@ -1931,10 +1968,50 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
 		tss->buf2 = MemoryContextAlloc(ssup->ssup_cxt, tss->buflen2);
 	}
 
-	memcpy(tss->buf1, a1p, len1);
-	tss->buf1[len1] = '\0';
-	memcpy(tss->buf2, a2p, len2);
-	tss->buf2[len2] = '\0';
+	/*
+	 * Attempt to re-use buffers across calls.  Also, avoid work in the event
+	 * of clustered together identical items by exploiting temporal locality.
+	 * This works well with divide-and-conquer, comparison-based sorts like
+	 * quicksort and mergesort.
+	 *
+	 * With quicksort, there is, in general, a pretty strong chance that the
+	 * same buffer contents can be used repeatedly for pivot items -- early
+	 * pivot items will account for large number of total comparisons, since
+	 * they must be compared against many (possibly all other) items.
+	 */
+	arg1_match = true;
+	if (len1 != tss->last_len1 || memcmp(tss->buf1, a1p, len1) != 0)
+	{
+		arg1_match = false;
+		memcpy(tss->buf1, a1p, len1);
+		tss->buf1[len1] = '\0';
+		tss->last_len1 = len1;
+	}
+
+	/*
+	 * While it is worth going to the trouble of trying to re-use buffer
+	 * contents across calls, ideally that will lead to entirely avoiding a
+	 * strcoll() call by using a cached return value.
+	 *
+	 * This optimization can work well again and again for the same set of
+	 * clustered together identical attributes;  as they're relocated to new
+	 * subpartitions, only one strcoll() is required for each pivot (in respect
+	 * of that clump of identical values, at least).  Similarly, the final
+	 * N-way merge of a mergesort can be effectively accelerated if each run
+	 * has its own locally clustered values.
+	 */
+	if (len2 != tss->last_len2 || memcmp(tss->buf2, a2p, len2) != 0)
+	{
+		memcpy(tss->buf2, a2p, len2);
+		tss->buf2[len2] = '\0';
+		tss->last_len2 = len2;
+	}
+	else if (arg1_match && tss->last_returned != INT_MIN)
+	{
+		/* Use result cached following last actual strcoll() call */
+		result = tss->last_returned;
+		goto done;
+	}
 
 #ifdef HAVE_LOCALE_T
 	if (tss->locale)
@@ -1951,6 +2028,8 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)
 	if (result == 0)
 		result = strcmp(tss->buf1, tss->buf2);
 
+	/* Cache result value, perhaps saving a costly strcoll() in next call */
+	tss->last_returned = result;
 done:
 	/* We can't afford to leak memory here. */
 	if (PointerGetDatum(arg1) != x)
@@ -1967,25 +2046,24 @@ done:
 static int
 bttextcmp_abbrev(Datum x, Datum y, SortSupport ssup)
 {
-	char	   *a = (char *) &x;
-	char	   *b = (char *) &y;
-	int			result;
-
-	result = memcmp(a, b, sizeof(Datum));
-
 	/*
-	 * When result = 0, the core system will call bttextfastcmp_c() or
+	 * When 0 is returned, the core system will call bttextfastcmp_c() or
 	 * bttextfastcmp_locale().  Even a strcmp() on two non-truncated strxfrm()
 	 * blobs cannot indicate *equality* authoritatively, for the same reason
 	 * that there is a strcoll() tie-breaker call to strcmp() in varstr_cmp().
 	 */
-	return result;
+	if (x > y)
+		return 1;
+	else if (x == y)
+		return 0;
+	else
+		return -1;
 }
 
 /*
  * Conversion routine for sortsupport.  Converts original text to abbreviated
  * key representation.  Our encoding strategy is simple -- pack the first 8
- * bytes of a strxfrm() blob into a Datum.
+ * bytes of a strxfrm() blob into a Datum, and treat it as an unsigned integer.
  */
 static Datum
 bttext_abbrev_convert(Datum original, SortSupport ssup)
@@ -2033,9 +2111,19 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
 			tss->buf1 = palloc(tss->buflen1);
 		}
 
+		/* Might be able to reuse strxfrm() blob from last call */
+		if (tss->last_len1 == len &&
+			memcmp(tss->buf1, authoritative_data, len) == 0)
+		{
+			memcpy(pres, tss->buf2, Min(sizeof(Datum), tss->last_len2));
+			/* No change affecting cardinality, so no hashing required */
+			goto done;
+		}
+
 		/* Just like strcoll(), strxfrm() expects a NUL-terminated string */
 		memcpy(tss->buf1, authoritative_data, len);
 		tss->buf1[len] = '\0';
+		tss->last_len1 = len;
 
 		for (;;)
 		{
@@ -2047,6 +2135,7 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
 #endif
 				bsize = strxfrm(tss->buf2, tss->buf1, tss->buflen2);
 
+			tss->last_len2 = bsize;
 			if (bsize < tss->buflen2)
 				break;
 
@@ -2104,6 +2193,9 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
 
 	addHyperLogLog(&tss->abbr_card, hash);
 
+done:
+	/* Represent Datum packed with string as uint */
+	res = string_uint(res);
 	/* Don't leak memory here */
 	if (PointerGetDatum(authoritative) != original)
 		pfree(authoritative);
-- 
1.9.1

From 10ad5a5975306e2ed9ba7f98a116536d606444b3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Mon, 29 Jun 2015 23:53:05 -0400
Subject: [PATCH 1/2] Add BSWAP64() byte-swapping macro

This includes detection of __builtin_bswap64() built-in availability,
which is preferred.  BSWAP64() can be used to convert big-endian
unsigned integers to little-endian unsigned integers, and vice-versa.

BSWAP32() is moved to c.h, alongside the new BSWAP64() macro.
---
 config/c-compiler.m4          | 18 ++++++++++++++++++
 configure                     | 24 ++++++++++++++++++++++++
 configure.in                  |  1 +
 src/include/c.h               | 23 +++++++++++++++++++++++
 src/include/pg_config.h.in    |  3 +++
 src/include/pg_config.h.win32 |  3 +++
 src/include/port/pg_crc32c.h  | 10 ----------
 7 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 050bfa5..f1b7045 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -248,6 +248,24 @@ fi])# PGAC_C_BUILTIN_BSWAP32
 
 
 
+# PGAC_C_BUILTIN_BSWAP64
+# -------------------------
+# Check if the C compiler understands __builtin_bswap64(),
+# and define HAVE__BUILTIN_BSWAP64 if so.
+AC_DEFUN([PGAC_C_BUILTIN_BSWAP64],
+[AC_CACHE_CHECK(for __builtin_bswap64, pgac_cv__builtin_bswap64,
+[AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+[static unsigned long int x = __builtin_bswap64(0xaabbccdd);]
+)],
+[pgac_cv__builtin_bswap64=yes],
+[pgac_cv__builtin_bswap64=no])])
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_BSWAP64, 1,
+          [Define to 1 if your compiler understands __builtin_bswap64.])
+fi])# PGAC_C_BUILTIN_BSWAP64
+
+
+
 # PGAC_C_BUILTIN_CONSTANT_P
 # -------------------------
 # Check if the C compiler understands __builtin_constant_p(),
diff --git a/configure b/configure
index 0407c4f..b921d63 100755
--- a/configure
+++ b/configure
@@ -10481,6 +10481,30 @@ if test x"$pgac_cv__builtin_bswap32" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_BSWAP32 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_bswap64" >&5
+$as_echo_n "checking for __builtin_bswap64... " >&6; }
+if ${pgac_cv__builtin_bswap64+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+static unsigned long int x = __builtin_bswap64(0xaabbccdd);
+
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_bswap64=yes
+else
+  pgac_cv__builtin_bswap64=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_bswap64" >&5
+$as_echo "$pgac_cv__builtin_bswap64" >&6; }
+if test x"$pgac_cv__builtin_bswap64" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_BSWAP64 1" >>confdefs.h
+
+fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_constant_p" >&5
 $as_echo_n "checking for __builtin_constant_p... " >&6; }
 if ${pgac_cv__builtin_constant_p+:} false; then :
diff --git a/configure.in b/configure.in
index 1de41a2..77cb477 100644
--- a/configure.in
+++ b/configure.in
@@ -1237,6 +1237,7 @@ PGAC_C_FUNCNAME_SUPPORT
 PGAC_C_STATIC_ASSERT
 PGAC_C_TYPES_COMPATIBLE
 PGAC_C_BUILTIN_BSWAP32
+PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
 PGAC_C_VA_ARGS
diff --git a/src/include/c.h b/src/include/c.h
index 92c5202..9874e39 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -963,6 +963,29 @@ typedef NameData *Name;
 #define STATUS_FOUND			(1)
 #define STATUS_WAITING			(2)
 
+/* byte swapping macros */
+#ifdef HAVE__BUILTIN_BSWAP32
+#define BSWAP32(x) __builtin_bswap32(x)
+#else
+#define BSWAP32(x) (((x << 24) & 0xff000000) | \
+					((x << 8) & 0x00ff0000) | \
+					((x >> 8) & 0x0000ff00) | \
+					((x >> 24) & 0x000000ff))
+#endif	/* HAVE__BUILTIN_BSWAP32 */
+
+#ifdef HAVE__BUILTIN_BSWAP64
+#define BSWAP64(x) __builtin_bswap64(x)
+#else
+#define BSWAP64(x) (StaticAssertExpr(sizeof(x) == 8, "argument not 8 bytes"), \
+		( (x << 56) & 0xff00000000000000UL ) | \
+		( (x << 40) & 0x00ff000000000000UL ) | \
+		( (x << 24) & 0x0000ff0000000000UL ) | \
+		( (x <<  8) & 0x000000ff00000000UL ) | \
+		( (x >>  8) & 0x00000000ff000000UL ) | \
+		( (x >> 24) & 0x0000000000ff0000UL ) | \
+		( (x >> 40) & 0x000000000000ff00UL ) | \
+		( (x >> 56) & 0x00000000000000ffUL ))
+#endif	/* HAVE__BUILTIN_BSWAP64 */
 
 /*
  * Append PG_USED_FOR_ASSERTS_ONLY to definitions of variables that are only
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5688f75..90c0e5f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -666,6 +666,9 @@
 /* Define to 1 if your compiler understands __builtin_bswap32. */
 #undef HAVE__BUILTIN_BSWAP32
 
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+#undef HAVE__BUILTIN_BSWAP64
+
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 #undef HAVE__BUILTIN_CONSTANT_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 22bbb91..4d484d1 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -520,6 +520,9 @@
 /* Define to 1 if your compiler understands __builtin_bswap32. */
 /* #undef HAVE__BUILTIN_BSWAP32 */
 
+/* Define to 1 if your compiler understands __builtin_bswap64. */
+/* #undef HAVE__BUILTIN_BSWAP64 */
+
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 /* #undef HAVE__BUILTIN_CONSTANT_P */
 
diff --git a/src/include/port/pg_crc32c.h b/src/include/port/pg_crc32c.h
index c925c56..e5389aa 100644
--- a/src/include/port/pg_crc32c.h
+++ b/src/include/port/pg_crc32c.h
@@ -71,16 +71,6 @@ extern pg_crc32c (*pg_comp_crc32c) (pg_crc32c crc, const void *data, size_t len)
 #define COMP_CRC32C(crc, data, len) \
 	((crc) = pg_comp_crc32c_sb8((crc), (data), (len)))
 #ifdef WORDS_BIGENDIAN
-
-#ifdef HAVE__BUILTIN_BSWAP32
-#define BSWAP32(x) __builtin_bswap32(x)
-#else
-#define BSWAP32(x) (((x << 24) & 0xff000000) | \
-					((x << 8) & 0x00ff0000) | \
-					((x >> 8) & 0x0000ff00) | \
-					((x >> 24) & 0x000000ff))
-#endif
-
 #define FIN_CRC32C(crc) ((crc) = BSWAP32(crc) ^ 0xFFFFFFFF)
 #else
 #define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

Reply via email to