Re: Refactor to introduce pg_strcoll().

Jeff Davis Wed, 09 Nov 2022 18:24:09 -0800

Attached some new patches. I think you were right that the API of
pg_strcoll() was strange before, so I changed it to:

  * pg_strcoll() takes nul-terminated arguments
  * pg_strncoll() takes arguments and lengths

The new pg_strcoll()/pg_strncoll() api in 0001 seems much reasonable to
support in the long term, potentially with other callers.

Patch 0004 exports:

  * pg_collate_icu() is for ICU and takes arguments and lengths
  * pg_collate_libc() is for libc and takes nul-terminated arguments

for a tiny speedup because varstrfastcmp_locale() has both nul-
terminated arguments and their lengths.

On Fri, 2022-11-04 at 15:06 -0700, Jeff Davis wrote:
> But I think the complex hacks are just the transformation into UTF 16
> and calling of wcscoll(). If that's done from within pg_strcoll()
> (after my patch), then I think it just works, right?

I included a patch (0005) to enable varstrfastcmp_locale on windows
with a server encoding of UTF-8. I don't have a great way of testing
this, but it seems like it should work.

> I was also considering an optimization to use stack allocation for
> short strings when doing the non-UTF8 icu comparison. I am seeing
> some
> benefit there

I also included this optimization in 0003: try to use the stack for
reasonable values; allocate on the heap for larger strings. I think
it's helping a bit.

Patch 0002 helps a lot more: for the case of ICU with a non-UTF8 server
encoding, the speedup is something like 30%. The reason is that
icu_to_uchar() currently calls ucnv_toUChars() twice: once to determine
the precise output buffer size required, and then again once it has the
buffer ready. But it's easy to just size the destination buffer
conservatively, because the maximum space required is enough to hold
twice the number of UChars as there are input characters[1], plus the
terminating nul. That means we just call ucnv_toUChars() once, to do
the actual conversion.

My perf test was to use a quick hack to disable abbreviated keys (so
that it's doing more comparisons), and then just do a large ORDER BY
... COLLATE on a table with generated text. The text is random lorem
ipsum data, with some characters replaced by multibyte characters to
exercise some more interesting paths. If someone else has some more
sophisticated collation tests I'd be interested to try them.

Regards,
        Jeff Davis

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucnv_8h.html#ae1049fcb893783c860fe0f9d4da84939

From 29249d17b5580e87365c008b18c83db5528db37e Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 9 Nov 2022 08:58:08 -0800
Subject: [PATCH v3 1/5] Introduce pg_strcoll() and pg_strncoll().

Hide the special cases of the platform, collation provider, and
database encoding in pg_locale.c. Simplify varlena.c.
---
 src/backend/utils/adt/pg_locale.c | 263 ++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c   | 230 +-------------------------
 src/include/utils/pg_locale.h     |   3 +
 3 files changed, 268 insertions(+), 228 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..94dc64c2d0 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -1731,6 +1737,263 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * win32_utf8_wcscoll
+ *
+ * Convert UTF8 arguments to wide characters and invoke wcscoll() or
+ * wcscoll_l().
+ */
+#ifdef WIN32
+static int
+win32_utf8_wcscoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				   pg_locale_t locale)
+{
+	char		a1buf[TEXTBUFLEN];
+	char		a2buf[TEXTBUFLEN];
+	char	   *a1p,
+			   *a2p;
+	int			a1len;
+	int			a2len;
+	int			r;
+	int			result;
+
+	if (len1 >= TEXTBUFLEN / 2)
+	{
+		a1len = len1 * 2 + 2;
+		a1p = palloc(a1len);
+	}
+	else
+	{
+		a1len = TEXTBUFLEN;
+		a1p = a1buf;
+	}
+	if (len2 >= TEXTBUFLEN / 2)
+	{
+		a2len = len2 * 2 + 2;
+		a2p = palloc(a2len);
+	}
+	else
+	{
+		a2len = TEXTBUFLEN;
+		a2p = a2buf;
+	}
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (a1p != a1buf)
+		pfree(a1p);
+	if (a2p != a2buf)
+		pfree(a2p);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * Collate using the libc provider. Arguments must be nul-terminated.
+ */
+static int
+pg_collate_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+#ifdef WIN32
+	/* Win32 does not have UTF-8, so we need to map to UTF-16 */
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = win32_utf8_wcscoll(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * Collate using the icu provider.
+ */
+static int
+pg_collate_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			   pg_locale_t locale)
+{
+#ifdef USE_ICU
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		int32_t		ulen1,
+					ulen2;
+		UChar	   *uchar1,
+				   *uchar2;
+
+		ulen1 = icu_to_uchar(&uchar1, arg1, len1);
+		ulen2 = icu_to_uchar(&uchar2, arg2, len2);
+
+		result = ucol_strcoll(locale->info.icu.ucol,
+							  uchar1, ulen1,
+							  uchar2, ulen2);
+
+		pfree(uchar1);
+		pfree(uchar2);
+	}
+
+	return result;
+#else							/* not USE_ICU */
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif							/* not USE_ICU */
+}
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * If the collation is deterministic, break ties with strcmp().
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (locale && locale->provider == COLLPROVIDER_ICU)
+	{
+		size_t		len1 = strlen(arg1);
+		size_t		len2 = strlen(arg2);
+		result = pg_collate_icu(arg1, len1, arg2, len2, locale);
+	}
+	else
+	{
+		result = pg_collate_libc(arg1, arg2, locale);
+	}
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+		result = strcmp(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * If the collation is deterministic, break ties with memcmp(), and then with
+ * the string length.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (locale && locale->provider == COLLPROVIDER_ICU)
+	{
+		result = pg_collate_icu(arg1, len1, arg2, len2, locale);
+	}
+	else
+	{
+		char	*arg1n = palloc(len1 + 1);
+		char	*arg2n = palloc(len2 + 1);
+
+		/* nul-terminate arguments */
+		memcpy(arg1n, arg1, len1);
+		arg1n[len1] = '\0';
+		memcpy(arg2n, arg2, len2);
+		arg2n[len2] = '\0';
+
+		result = pg_collate_libc(arg1n, arg2n, locale);
+
+		pfree(arg1n);
+		pfree(arg2n);
+	}
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+	{
+		result = memcmp(arg1, arg2, Min(len1, len2));
+		if ((result == 0) && (len1 != len2))
+			result = (len1 < len2) ? -1 : 1;
+	}
+
+	return result;
+}
 
 #ifdef USE_ICU
 /*
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c5e7ee7ca2..c904bc0825 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1535,10 +1535,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1555,171 +1551,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
-
-		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 	}
 
 	return result;
@@ -2377,65 +2209,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
-
-	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
-		result = strcmp(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
 	sss->cache_blob = false;
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..bf70ae08ca 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -100,6 +100,9 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

From dd14c838acc65aa90a6456222d829ae1986f1c41 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Tue, 8 Nov 2022 16:46:27 -0800
Subject: [PATCH v3 2/5] Refactor icu_to_uchar().

Separate buffer allocation and conversion so that the caller can
supply its own buffer. This also enables the caller to size the
destination to the maximum requirement of twice the number of input
characters, rather than go through ucnv_toUChars() to compute the
precise destination buffer size requirement (which is more expensive).
---
 src/backend/utils/adt/pg_locale.c | 128 +++++++++++++++++++++---------
 1 file changed, 92 insertions(+), 36 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 94dc64c2d0..17857067a3 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -129,6 +129,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1859,6 +1872,47 @@ pg_collate_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	return result;
 }
 
+/*
+ * Collate using the icu provider when the database encoding is not UTF-8.
+ */
+#ifdef USE_ICU
+static int
+pg_collate_icu_no_utf8(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale)
+{
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	uchar1 = palloc(bufsize1);
+	uchar2 = palloc(bufsize2);
+
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	pfree(uchar1);
+	pfree(uchar2);
+
+	return result;
+}
+#endif
+
 /*
  * Collate using the icu provider.
  */
@@ -1888,20 +1942,7 @@ pg_collate_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 	else
 #endif
 	{
-		int32_t		ulen1,
-					ulen2;
-		UChar	   *uchar1,
-				   *uchar2;
-
-		ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-		ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-		result = ucol_strcoll(locale->info.icu.ucol,
-							  uchar1, ulen1,
-							  uchar2, ulen2);
-
-		pfree(uchar1);
-		pfree(uchar2);
+		result = pg_collate_icu_no_utf8(arg1, len1, arg2, len2, locale);
 	}
 
 	return result;
@@ -1996,13 +2037,6 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 }
 
 #ifdef USE_ICU
-/*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
- */
-static UConverter *icu_converter = NULL;
-
 static void
 init_icu_converter(void)
 {
@@ -2030,6 +2064,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -2045,26 +2112,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
-- 
2.34.1

From 3f6cde0dea813ad45d05bd7655a66bf17a88d58b Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 9 Nov 2022 12:26:38 -0800
Subject: [PATCH v3 3/5] Optimize buffer allocation for
 pg_strcoll()/pg_strncoll().

When it's necessary to convert the arguments before comparison
(e.g. to a UChar string), use stack space if the arguments are small
enough to fit; otherwise fall back to a heap allocation.
---
 src/backend/utils/adt/pg_locale.c | 70 +++++++++++++++----------------
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 17857067a3..7c29519214 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1761,35 +1761,20 @@ static int
 win32_utf8_wcscoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 				   pg_locale_t locale)
 {
-	char		a1buf[TEXTBUFLEN];
-	char		a2buf[TEXTBUFLEN];
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
 	char	   *a1p,
 			   *a2p;
-	int			a1len;
-	int			a2len;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
 	int			r;
 	int			result;
 
-	if (len1 >= TEXTBUFLEN / 2)
-	{
-		a1len = len1 * 2 + 2;
-		a1p = palloc(a1len);
-	}
-	else
-	{
-		a1len = TEXTBUFLEN;
-		a1p = a1buf;
-	}
-	if (len2 >= TEXTBUFLEN / 2)
-	{
-		a2len = len2 * 2 + 2;
-		a2p = palloc(a2len);
-	}
-	else
-	{
-		a2len = TEXTBUFLEN;
-		a2p = a2buf;
-	}
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
 
 	/* API does not work for zero-length input */
 	if (len1 == 0)
@@ -1830,10 +1815,8 @@ win32_utf8_wcscoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		ereport(ERROR,
 				(errmsg("could not compare Unicode strings: %m")));
 
-	if (a1p != a1buf)
-		pfree(a1p);
-	if (a2p != a2buf)
-		pfree(a2p);
+	if (buf != sbuf)
+		pfree(buf);
 
 	return result;
 }
@@ -1880,6 +1863,8 @@ static int
 pg_collate_icu_no_utf8(const char *arg1, size_t len1,
 					   const char *arg2, size_t len2, pg_locale_t locale)
 {
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
 	int32_t	 ulen1;
 	int32_t	 ulen2;
 	size_t   bufsize1;
@@ -1896,8 +1881,11 @@ pg_collate_icu_no_utf8(const char *arg1, size_t len1,
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
 
-	uchar1 = palloc(bufsize1);
-	uchar2 = palloc(bufsize2);
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
 
 	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
 	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
@@ -1906,8 +1894,8 @@ pg_collate_icu_no_utf8(const char *arg1, size_t len1,
 						  uchar1, ulen1,
 						  uchar2, ulen2);
 
-	pfree(uchar1);
-	pfree(uchar2);
+	if (buf != sbuf)
+		pfree(buf);
 
 	return result;
 }
@@ -2010,8 +1998,18 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 	}
 	else
 	{
-		char	*arg1n = palloc(len1 + 1);
-		char	*arg2n = palloc(len2 + 1);
+		char	 sbuf[TEXTBUFLEN];
+		char	*buf = sbuf;
+		size_t	 bufsize1 = len1 + 1;
+		size_t	 bufsize2 = len2 + 1;
+		char	*arg1n;
+		char	*arg2n;
+
+		if (bufsize1 + bufsize2 > TEXTBUFLEN)
+			buf = palloc(bufsize1 + bufsize2);
+
+		arg1n = buf;
+		arg2n = buf + bufsize1;
 
 		/* nul-terminate arguments */
 		memcpy(arg1n, arg1, len1);
@@ -2021,8 +2019,8 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 
 		result = pg_collate_libc(arg1n, arg2n, locale);
 
-		pfree(arg1n);
-		pfree(arg2n);
+		if (buf != sbuf)
+			pfree(buf);
 	}
 
 	/* Break tie if necessary. */
-- 
2.34.1

From 8aea65b0a32a0bee9432ccd537d2852e4cc74087 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 9 Nov 2022 17:00:01 -0800
Subject: [PATCH v3 4/5] Export pg_collate_libc() and pg_collate_icu().

Minor optimization for callers such as varstrfastcmp_locale(), which
has both nul-terminated arguments, and also the string length.
---
 src/backend/utils/adt/pg_locale.c | 40 ++++++++++++++++---------------
 src/backend/utils/adt/varlena.c   |  6 ++++-
 src/include/utils/pg_locale.h     |  4 ++++
 3 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 7c29519214..df809891ac 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1824,8 +1824,10 @@ win32_utf8_wcscoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 
 /*
  * Collate using the libc provider. Arguments must be nul-terminated.
+ *
+ * If the collation is deterministic, break ties with strcmp().
  */
-static int
+int
 pg_collate_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 {
 	int result;
@@ -1852,6 +1854,10 @@ pg_collate_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	else
 		result = strcoll(arg1, arg2);
 
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+		result = strcmp(arg1, arg2);
+
 	return result;
 }
 
@@ -1865,8 +1871,8 @@ pg_collate_icu_no_utf8(const char *arg1, size_t len1,
 {
 	char	 sbuf[TEXTBUFLEN];
 	char	*buf = sbuf;
-	int32_t	 ulen1;
-	int32_t	 ulen2;
+	int32_t	 ulen1 = len1 * 2;
+	int32_t	 ulen2 = len2 * 2;
 	size_t   bufsize1;
 	size_t   bufsize2;
 	UChar	*uchar1,
@@ -1875,9 +1881,6 @@ pg_collate_icu_no_utf8(const char *arg1, size_t len1,
 
 	init_icu_converter();
 
-	ulen1 = uchar_length(icu_converter, arg1, len1);
-	ulen2 = uchar_length(icu_converter, arg2, len2);
-
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
 
@@ -1903,8 +1906,11 @@ pg_collate_icu_no_utf8(const char *arg1, size_t len1,
 
 /*
  * Collate using the icu provider.
+ *
+ * If the collation is deterministic, break ties with memcmp(), and then with
+ * the string length.
  */
-static int
+int
 pg_collate_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 			   pg_locale_t locale)
 {
@@ -1933,6 +1939,14 @@ pg_collate_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		result = pg_collate_icu_no_utf8(arg1, len1, arg2, len2, locale);
 	}
 
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+	{
+		result = memcmp(arg1, arg2, Min(len1, len2));
+		if ((result == 0) && (len1 != len2))
+			result = (len1 < len2) ? -1 : 1;
+	}
+
 	return result;
 #else							/* not USE_ICU */
 	/* shouldn't happen */
@@ -1967,10 +1981,6 @@ pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
 		result = pg_collate_libc(arg1, arg2, locale);
 	}
 
-	/* Break tie if necessary. */
-	if (result == 0 && (!locale || locale->deterministic))
-		result = strcmp(arg1, arg2);
-
 	return result;
 }
 
@@ -2023,14 +2033,6 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 			pfree(buf);
 	}
 
-	/* Break tie if necessary. */
-	if (result == 0 && (!locale || locale->deterministic))
-	{
-		result = memcmp(arg1, arg2, Min(len1, len2));
-		if ((result == 0) && (len1 != len2))
-			result = (len1 < len2) ? -1 : 1;
-	}
-
 	return result;
 }
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c904bc0825..5ebfe3acb7 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2209,7 +2209,11 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
+	if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+		result = pg_collate_icu(sss->buf1, len1, sss->buf2, len2,
+								sss->locale);
+	else
+		result = pg_collate_libc(sss->buf1, sss->buf2, sss->locale);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
 	sss->cache_blob = false;
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index bf70ae08ca..5f7a8ea435 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -103,6 +103,10 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
 extern int pg_strncoll(const char *arg1, size_t len1,
 					   const char *arg2, size_t len2, pg_locale_t locale);
+extern int pg_collate_libc(const char *arg1, const char *arg2,
+						   pg_locale_t locale);
+extern int pg_collate_icu(const char *arg1, size_t len1,
+						  const char *arg2, size_t len2, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

From 211dfc48f879b70fe18326f42bcd29ccefdf2c03 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 9 Nov 2022 13:56:19 -0800
Subject: [PATCH v3 5/5] Allow windows to use varlenafastcmp_locale.

Now that the combination of windows with the libc provider and UTF-8
for the database encoding is handled by pg_collate_libc(), we can
support varlenafastcmp_locale for windows as well.
---
 src/backend/utils/adt/varlena.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 5ebfe3acb7..e23c2ca31e 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1887,20 +1887,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
-- 
2.34.1

Re: Refactor to introduce pg_strcoll().

Reply via email to