[HACKERS] Patch to address concerns about ICU collcollate stability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)

Peter Geoghegan Sun, 24 Sep 2017 21:26:23 -0700

On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> After reviewing this thread and testing around a bit, I think the code
> is mostly fine as it is, but the documentation is lacking.  Hence,
> attached is a patch to expand the documentation quite a bit, especially
> to document in more detail what ICU locale strings are accepted.


Attached is my counterproposal. This patch has us always make sure
that collcollate is in BCP 47 format, even on ICU versions prior to
ICU 54.

Other improvements/bug fixes:

* Sanitizes locale names within CREATE COLLATION. This has been tested
on ICU 42 (earliest supported version) and ICU 55.

* Enforces a NAMEDATALEN restriction for collcollate during CREATE
COLLATION, forbidding collcollate truncation. This is useful because
truncating can allow subtly wrong answers later on.

* Adds DEBUG1 message with ICU display name, so we can satisfy
ourselves that we're getting the expected behavior, to some degree.

I used this to confirm that we get consistent behavior between ICU 42
and 55 for CREATE COLLATION. On ICU 42, keyword collation attributes
(e.g., the emoji keyword, numeric ordering for natural sort order)
still don't work, just as before, but the locale string is still
considered valid. (This is because ucol_setAttribute() is supposed to
be used there).

* Documents the aforementioned keyword collation attribute restriction
on ICU versions before ICU 54. This was needed anyway. We only claim
for Postgres collations what the ICU docs claim for ICU collators,
even though there is reason to believe that some ICU versions before
ICU 54 actually can do better.

* When using ICU 4.2, the examples in the docs (variant collations
like German Phonebook order) now actually work.

The examples are completely broken right now IMV, because the user has
to know that they are on ICU 4.2, which only accepts the old style
locale strings as things stand. And, they'd have no obvious indication
that things were broken without this patch, because there would have
been no sanitization or other feedback.

* Creates root collation as root-x-icu (collcollate "root"), not
und-x-icu. "und" means undefined language.

* Moves all knowledge of ICU version issues to within a few
pg_locale.c routines, leaving the code reasonably well encapsulated.

* Does an encoding conversion when getting a display name for the
initdb collation comment. This needs to be ascii-safe due to the
initdb/template0 DB encoding restriction, but I suspect that the way
we do it now is subtly wrong. This does imply that SQL_ASCII databases
will never get ICU pg_collation entries that come with comments added
by initdb, but such databases were already unable to use the
collations anyway, so this is no loss at all. (SQL_ASCII is mentioned
here as an example of a database collation that ICU doesn't support,
all of which are affected in this way.)

I decided to implement CREATE COLLATION sanitization based on whether
or not a display string can be generated (if not, or if it's empty, we
reject). This seems like about the right level of strictness to me,
because we're still very forgiving. I admit that that's a little bit
arbitrary, but it does seem to be a good match for Postgres; it is
forgiving of things that could plausibly make sense on another ICU
version to some user at some time, but rejects most things that are
inherently wrong, IMHO. You can still ask for Japanese as spoken in
Madagascar, or even specify a language that ICU has never heard of,
and there is no error. It catches syntax errors only. See the slightly
expanded tests for details. I'm very open to negotiating the exact
details of how we sanitize, but any level of sanitization will be at
least a little bit arbitrary (including what we have right now, which
is no sanitization).

Aside from the specific problems for Postgres that I've noted that the
patch prevents or fixes, there is another reason to do this. The old
style locale name format is officially deprecated by ICU, which makes
it seem like we should never expose it to users in the first place.
Per ICU docs:

"Starting with ICU 54, the following naming scheme and its API
functions are deprecated. Use ucol_open() with language tag collation
keywords instead (see Collation API Details)" [1]

[1] 
http://userguide.icu-project.org/collation/concepts#TOC-Collator-naming-scheme
-- 
Peter Geoghegan

From 989bc2f877aa01af4ac140a43a5cebcffe8b3ec9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 21 Sep 2017 17:34:13 -0700
Subject: [PATCH] Consistently canonicalize ICU collations' collcollate as BCP
 47.

Previously, on versions of ICU prior to ICU 54 collcollate was stored in
the legacy locale format at initdb time.  We now always use the BCP 47
format there.  To make that work, collcollate is now converted from BCP
47 format to the old locale tag format on-the-fly as an ICU collator is
initially opened within a backend (though only on versions of ICU before
ICU 54).  This establishes a general convention: ICU collations'
collcollate should always be in BCP 47 format.

While at it, have CREATE COLLATION perform some simple locale name
sanitization.  This works by assuming that an inability to get a display
name for a given locale name indicates that the locale name is invalid.
The old locale format is now considered to be an implementation detail.
Only BCP 47 was ever documented as supported, so no doc changes are
required here.

In passing, add DEBUG1 message showing the ICU display name of a
collation when a CREATE COLLATION command is executed, and explicitly
name the root locale "root" when populating pg_collation during initdb.
---
 doc/src/sgml/charset.sgml                      |  13 +-
 src/backend/commands/collationcmds.c           | 110 ++++++---------
 src/backend/utils/adt/pg_locale.c              | 180 ++++++++++++++++++++++++-
 src/include/utils/pg_locale.h                  |   3 +
 src/test/regress/expected/collate.icu.utf8.out |  25 +++-
 src/test/regress/sql/collate.icu.utf8.sql      |  22 ++-
 6 files changed, 257 insertions(+), 96 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 44e4350..85552a6 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -677,7 +677,7 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
      </varlistentry>
 
      <varlistentry>
-      <term><literal>und-x-icu</literal> (for <quote>undefined</quote>)</term>
+      <term><literal>root-x-icu</literal></term>
       <listitem>
        <para>
         ICU <quote>root</quote> collation.  Use this to get a reasonable
@@ -711,7 +711,7 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
      </varlistentry>
 
      <varlistentry>
-      <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji')</literal></term>
+      <term><literal>CREATE COLLATION "root-u-co-emoji-x-icu" (provider = icu, locale = 'root-u-co-emoji')</literal></term>
       <listitem>
        <para>
         Root collation with Emoji collation type, per Unicode Technical Standard #51
@@ -723,7 +723,8 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
       <term><literal>CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit')</literal></term>
       <listitem>
        <para>
-        Sort digits after Latin letters.  (The default is digits before letters.)
+        Sort digits after Latin letters.  (The default is digits
+        before letters.  Requires ICU version 54 or higher.)
        </para>
       </listitem>
      </varlistentry>
@@ -733,7 +734,7 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
       <listitem>
        <para>
         Sort upper-case letters before lower-case letters.  (The default is
-        lower-case letters first.)
+        lower-case letters first.  Requires ICU version 54 or higher.)
        </para>
       </listitem>
      </varlistentry>
@@ -742,7 +743,7 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
       <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-latn-digit')</literal></term>
       <listitem>
        <para>
-        Combines both of the above options.
+        Combines both of the above options.  (Requires ICU version 54 or higher.)
        </para>
       </listitem>
      </varlistentry>
@@ -753,7 +754,7 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
        <para>
         Numeric ordering, sorts sequences of digits by their numeric value,
         for example: <literal>A-21</literal> &lt; <literal>A-123</literal>
-        (also known as natural sort).
+        (also known as natural sort).  (Requires ICU version 54 or higher.)
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 9437731..d64ce01 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -189,7 +189,30 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 	if (!fromEl)
 	{
 		if (collprovider == COLLPROVIDER_ICU)
+		{
+#ifdef USE_ICU
+			char *dname;
+
+			/* Create ICU display name to sanitize user locale string */
+			dname = icu_get_dname_from_collcollate(collcollate);
+
+			/* If no display name created, assume an invalid collcollate */
+			if (!dname)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_NAME),
+						 errmsg("invalid ICU locale string \"%s\"", collcollate)));
+			else
+				ereport(DEBUG1,
+						(errmsg("collation display name: \"%s\"", dname)));
+#endif
+			/* Do not permit truncation of ICU collation's collcollate */
+			if (strlen(collcollate) >= NAMEDATALEN)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("ICU locale string too long")));
+
 			collencoding = -1;
+		}
 		else
 		{
 			collencoding = GetDatabaseEncoding();
@@ -433,67 +456,6 @@ cmpaliases(const void *a, const void *b)
 #endif							/* READ_LOCALE_A_OUTPUT */
 
 
-#ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), TRUE, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
-/*
- * Get a comment (specifically, the display name) for an ICU locale.
- * The result is a palloc'd string, or NULL if we can't get a comment
- * or find that it's not all ASCII.  (We can *not* accept non-ASCII
- * comments, because the contents of template0 must be encoding-agnostic.)
- */
-static char *
-get_icu_locale_comment(const char *localename)
-{
-	UErrorCode	status;
-	UChar		displayname[128];
-	int32		len_uchar;
-	int32		i;
-	char	   *result;
-
-	status = U_ZERO_ERROR;
-	len_uchar = uloc_getDisplayName(localename, "en",
-									displayname, lengthof(displayname),
-									&status);
-	if (U_FAILURE(status))
-		return NULL;			/* no good reason to raise an error */
-
-	/* Check for non-ASCII comment (can't use is_all_ascii for this) */
-	for (i = 0; i < len_uchar; i++)
-	{
-		if (displayname[i] > 127)
-			return NULL;
-	}
-
-	/* OK, transcribe */
-	result = palloc(len_uchar + 1);
-	for (i = 0; i < len_uchar; i++)
-		result[i] = displayname[i];
-	result[len_uchar] = '\0';
-
-	return result;
-}
-#endif							/* USE_ICU */
-
-
 /*
  * pg_import_system_collations: add known system collations to pg_collation
  */
@@ -687,28 +649,32 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 		 */
 		for (i = -1; i < uloc_countAvailable(); i++)
 		{
-			const char *name;
-			char	   *langtag;
+			const char *localename;
 			char	   *icucomment;
-			const char *collcollate;
+			char	   *collcollate;
 			Oid			collid;
 
 			if (i == -1)
-				name = "";		/* ICU root locale */
+				localename = "root";	/* ICU root locale */
 			else
-				name = uloc_getAvailable(i);
+				localename = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
-			collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
+			/* collcollate is canonicalized in BCP 47 language tag format */
+			collcollate = icu_from_localename(localename);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
 			 * pg_collation
 			 */
-			if (!is_all_ascii(langtag) || !is_all_ascii(collcollate))
+			if (!is_all_ascii(collcollate))
 				continue;
 
-			collid = CollationCreate(psprintf("%s-x-icu", langtag),
+			/*
+			 * Collation name is collcollate with private BCP 47 subtag *-x-icu
+			 * appended, to ensure that there will never be a name conflict
+			 * with another collation provider's collation.
+			 */
+			collid = CollationCreate(psprintf("%s-x-icu", collcollate),
 									 nspid, GetUserId(),
 									 COLLPROVIDER_ICU, -1,
 									 collcollate, collcollate,
@@ -720,8 +686,8 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 
 				CommandCounterIncrement();
 
-				icucomment = get_icu_locale_comment(name);
-				if (icucomment)
+				icucomment = icu_get_dname_from_collcollate(collcollate);
+				if (icucomment && is_all_ascii(icucomment))
 					CreateComments(collid, CollationRelationId, 0,
 								   icucomment);
 			}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 5ad75ef..58e0454 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1367,18 +1367,34 @@ pg_newlocale_from_collation(Oid collid)
 #ifdef USE_ICU
 			UCollator  *collator;
 			UErrorCode	status;
+			const char *iculocalename;
+			char	   *localenameold = NULL;
 
 			if (strcmp(collcollate, collctype) != 0)
 				ereport(ERROR,
 						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 						 errmsg("collations with different collate and ctype values are not supported by ICU")));
 
+			/*
+			 * Versions of ICU prior to ICU 54 do not support locale names in
+			 * BCP 47 format within ucol_open().  Convert as needed.
+			 */
+#if U_ICU_VERSION_MAJOR_NUM >= 54
+			iculocalename = collcollate;
+#else
+			localenameold = icu_to_localename(collcollate);
+			iculocalename = localenameold;
+#endif
+
 			status = U_ZERO_ERROR;
-			collator = ucol_open(collcollate, &status);
+			collator = ucol_open(iculocalename, &status);
 			if (U_FAILURE(status))
 				ereport(ERROR,
-						(errmsg("could not open collator for locale \"%s\": %s",
-								collcollate, u_errorName(status))));
+						(errmsg("could not open collator for collcollate \"%s\" (locale \"%s\"): %s",
+								collcollate, iculocalename, u_errorName(status))));
+
+			if (localenameold)
+				pfree(localenameold);
 
 			/* We will leak this string if we get an error below :-( */
 			result.info.icu.locale = MemoryContextStrdup(TopMemoryContext,
@@ -1459,14 +1475,32 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UCollator  *collator;
 		UErrorCode	status;
 		UVersionInfo versioninfo;
+		const char *iculocalename;
+		char	   *localenameold = NULL;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
+		/*
+		 * For ICU, we expect collcollate to be in BCP 47 format, per general
+		 * convention.  Versions of ICU prior to ICU 54 do not support locale
+		 * names in BCP 47 format within ucol_open(), though.  Convert as
+		 * needed.
+		 */
+#if U_ICU_VERSION_MAJOR_NUM >= 54
+		iculocalename = collcollate;
+#else
+
+		localenameold = icu_to_localename(collcollate);
+		iculocalename = localenameold;
+#endif
+
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = ucol_open(iculocalename, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
+					(errmsg("could not open collator for collcollate \"%s\" (locale \"%s\"): %s",
+							collcollate, iculocalename, u_errorName(status))));
+		if (localenameold)
+			pfree(localenameold);
 		ucol_getVersion(collator, versioninfo);
 		ucol_close(collator);
 
@@ -1588,6 +1622,140 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
 	return len_result;
 }
 
+/*
+ * Get an ICU locale name for the specified ICU collcollate.  Since ICU
+ * collcollate is in BCP 47 format on every ICU version, per general
+ * convention, this is simply a wrapper on the ICU facility for converting to
+ * old style locale names from BCP 47 format.
+ *
+ * We export this just to be consistent.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+icu_to_localename(const char *collcollate)
+{
+	UErrorCode	status;
+	char		buf[ULOC_FULLNAME_CAPACITY];
+
+	/*
+	 * We never actually need to use this on versions of ICU that are
+	 * documented as supporting the BCP 47 format natively (See ucol_open()
+	 * documentation).  While it would be harmless to do unneeded translations,
+	 * it still seems like a good idea to avoid working with the old locale
+	 * format where it is deprecated.  (This does mean that the old format is
+	 * actually accepted by CREATE COLLATION with later ICU versions, but that
+	 * doesn't seem worth fixing.)
+	 */
+	Assert(U_ICU_VERSION_MAJOR_NUM < 54);
+
+	/*
+	 * There may be some loss of information in very rare cases where
+	 * "grandfathered [BCP 47] tags have no modern [ISO 639-1, 639-2, 639-3]
+	 * replacement".  Most constructed language tags have a mapping available,
+	 * and no constructed language has its own CLDR collation file, so this is
+	 * considered to be acceptable.
+	 */
+	status = U_ZERO_ERROR;
+	uloc_forLanguageTag(collcollate, buf, sizeof(buf), NULL, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not get ICU locale name for collation \"%s\": %s",
+						collcollate, u_errorName(status))));
+
+	return pstrdup(buf);
+}
+
+/*
+ * Get an ICU collcollate for the specified ICU locale name.  Since ICU
+ * collcollate is in BCP 47 format on every ICU version, per general
+ * convention, this is simply a wrapper on the ICU facility for converting old
+ * style locale names to BCP 47 format.
+ *
+ * The result is a palloc'd string.
+ */
+char *
+icu_from_localename(const char *localename)
+{
+	UErrorCode	status;
+	char		buf[ULOC_FULLNAME_CAPACITY];
+
+	status = U_ZERO_ERROR;
+	uloc_toLanguageTag(localename, buf, sizeof(buf), TRUE, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not get BCP 47 language tag for ICU locale name \"%s\": %s",
+						localename, u_errorName(status))));
+
+	return pstrdup(buf);
+}
+
+/*
+ * Get a display name for an ICU collation.
+ *
+ * Note that we expect collcollate to be in BCP 47 format, per general
+ * convention.  As a consequence of the fact that ICU is an upstream consumer
+ * of the CLDR locale data, producing a display name that describes all aspects
+ * of collation behavior (e.g., maps "kf" to "colcasefirst=upper") doesn't
+ * necessarily indicate that that's the behavior you get, at least for
+ * collation attributes (variant collations, like "de@collation=phonebook",
+ * still work on all versions, though).  Some ICU versions (at least ICU 4.2)
+ * require that collation attributes be set using ucol_setAttribute(), which
+ * never happens.
+ *
+ * The result is a palloc'd string, or NULL if we can't get a display name
+ * because collcollate is entirely invalid (this can also happen during initdb,
+ * when an ICU incompatible database encoding is in use).  Display name is
+ * currently always in English.  This should be changed to localize display
+ * name at some point.
+ */
+char *
+icu_get_dname_from_collcollate(const char *collcollate)
+{
+	UErrorCode	status;
+	UChar		displayname[256];
+	int32_t		len_uchar;
+	const char *iculocalename;
+	char	   *localenameold = NULL;
+	char	   *result;
+
+	/* No point in even trying if database encoding unsupported */
+	if (!is_encoding_supported_by_icu(GetDatabaseEncoding()))
+		return NULL;
+
+	/*
+	 * Versions of ICU prior to ICU 54 do not support locale names in BCP 47
+	 * format within ucol_open().  We assume that the same problem exists for
+	 * uloc_getDisplayName().  Convert as needed.
+	 */
+#if U_ICU_VERSION_MAJOR_NUM >= 54
+	iculocalename = collcollate;
+#else
+	localenameold = icu_to_localename(collcollate);
+	iculocalename = localenameold;
+#endif
+	status = U_ZERO_ERROR;
+	len_uchar = uloc_getDisplayName(iculocalename, ULOC_ENGLISH,
+									displayname, lengthof(displayname),
+									&status);
+
+	if (localenameold)
+		pfree(localenameold);
+
+	if (U_FAILURE(status) || len_uchar <= 0)
+	{
+		ereport(DEBUG1,
+				(errmsg("could not get display name for collation \"%s\": %s",
+						collcollate, u_errorName(status))));
+
+		return NULL;
+	}
+
+	/* Return display name in the database encoding */
+	icu_from_uchar(&result, displayname, len_uchar);
+
+	return result;
+}
 #endif							/* USE_ICU */
 
 /*
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index b633511..9a5bc55 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -107,6 +107,9 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_to_localename(const char *collcollate);
+extern char *icu_from_localename(const char *localename);
+extern char *icu_get_dname_from_collcollate(const char *collcollatebool);
 #endif
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index e1fc998..664dd47 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -983,22 +983,35 @@ CREATE SCHEMA test_schema;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
-          quote_literal(current_setting('lc_collate')) || ');';
+          'en' || ');';
 END
 $$;
 CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
 ERROR:  collation "test0" already exists
+--  Fail, ICU locale string exceeds default NAMEDATALEN (and any reasonable
+--  custom build value):
+do $inline$
+BEGIN
+  EXECUTE $cc$CREATE COLLATION test1 (provider = icu, locale = '$cc$ ||
+          'en-x-' || repeat('har', 100) || $cc$');$cc$;
+END
+$inline$;
+ERROR:  ICU locale string too long
+CONTEXT:  SQL statement "CREATE COLLATION test1 (provider = icu, locale = 'en-x-harharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharharhar');"
+PL/pgSQL function inline_code_block line 3 at EXECUTE
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test1 (provider = icu, lc_collate = ' ||
-          quote_literal(current_setting('lc_collate')) ||
+          'en' ||
           ', lc_ctype = ' ||
-          quote_literal(current_setting('lc_ctype')) || ');';
+          'en' || ');';
 END
 $$;
-CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, need lc_ctype
+CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, need lc_ctype (would fail anyway)
 ERROR:  parameter "lc_ctype" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+CREATE COLLATION alien (provider = icu, locale = 'alien'); /* never fails, could be unknown language */  DROP COLLATION alien;
+CREATE COLLATION latin (provider = icu, locale = 'und-latn'); /* works even on ICU 4.2, sensible use of und placeholder */
+DROP COLLATION latin;
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1123,4 +1136,4 @@ drop cascades to function mylt2(text,text)
 drop cascades to function dup(anyelement)
 RESET search_path;
 -- leave a collation for pg_upgrade test
-CREATE COLLATION coll_icu_upgrade FROM "und-x-icu";
+CREATE COLLATION coll_icu_upgrade FROM "root-x-icu";
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index ef39445..adb238c 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -343,20 +343,30 @@ CREATE SCHEMA test_schema;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
-          quote_literal(current_setting('lc_collate')) || ');';
+          'en' || ');';
 END
 $$;
 CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
+--  Fail, ICU locale string exceeds default NAMEDATALEN (and any reasonable
+--  custom build value):
+do $inline$
+BEGIN
+  EXECUTE $cc$CREATE COLLATION test1 (provider = icu, locale = '$cc$ ||
+          'en-x-' || repeat('har', 100) || $cc$');$cc$;
+END
+$inline$;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test1 (provider = icu, lc_collate = ' ||
-          quote_literal(current_setting('lc_collate')) ||
+          'en' ||
           ', lc_ctype = ' ||
-          quote_literal(current_setting('lc_ctype')) || ');';
+          'en' || ');';
 END
 $$;
-CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, need lc_ctype
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, need lc_ctype (would fail anyway)
+CREATE COLLATION alien (provider = icu, locale = 'alien'); /* never fails, could be unknown language */  DROP COLLATION alien;
+CREATE COLLATION latin (provider = icu, locale = 'und-latn'); /* works even on ICU 4.2, sensible use of und placeholder */
+DROP COLLATION latin;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
@@ -430,4 +440,4 @@ DROP SCHEMA collate_tests CASCADE;
 RESET search_path;
 
 -- leave a collation for pg_upgrade test
-CREATE COLLATION coll_icu_upgrade FROM "und-x-icu";
+CREATE COLLATION coll_icu_upgrade FROM "root-x-icu";
-- 
2.7.4

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Patch to address concerns about ICU collcollate stability in v10 (Was: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?)

Reply via email to