On Fri, 2023-03-24 at 10:10 +0100, Peter Eisentraut wrote:
> Couldn't we do this in a simpler way by just freeing the collator
> before 
> the ereport() calls.

I committed a tiny patch to do this.

We still need to address the error inconsistency though. The problem is
that, in older ICU versions, if the fixup for "und@colNumeric=lower" ->
"root@colNumeric=lower" is applied, then icu_set_collation_attributes()
will throw an error reporting "root@colNumeric=lower", which is not
what the user typed.

We could fix that directly by passing the original string to
icu_set_collation_attributes() instead, or perhaps as an extra
parameter used only for the ereport().

I like the minor refactoring I did better, though. It puts the
ereports() close to each other, so any differences are more obvious.
And it seems cleaner to me for pg_ucol_open to close the UCollator
because it's the one that opened it. I don't have a strong opinion, but
that's my reasoning.

>   Or wrap a PG_TRY/PG_FINALLY around the whole thing?

I generally avoid PG_TRY/FINALLY unless it avoids some major
awkwardness or other problem.

> It would be nicer to not make the callers of 
> icu_set_collation_attributes() responsible for catching and reporting
> the errors.

There's only one caller now: pg_ucol_open().

> [PATCH v8 2/4] initdb: emit message when using default ICU locale.
> 
> I'm not able to make initdb print this message.  Under what 
> circumstances am I supposed to see this?  Do you have some examples?

It happens when you don't specify --icu-locale. It is slightly
redundant with "ICU locale", but it lets you see that it came from the
environment rather than the command line:

-------------
$ initdb -D data   
The files belonging to this database system will be owned by user
"someone".
This user must also own the server process.

Using default ICU locale "en_US_POSIX".
The database cluster will be initialized with this locale
configuration:
  provider:    icu
  ICU locale:  en_US_POSIX
...
-------------

That seems fairly useful for testing, etc., where initdb.log doesn't
show the command line options.

> The function check_icu_locale() has now gotten a lot more
> functionality 
> than its name suggests.  Maybe the part that assigns the default ICU 
> locale should be moved up one level to setlocales(), which has a
> better 
> name and does something similar for the libc locale realm.

Agreed, done.

In fact, initdb.c:check_icu_locale() is completely unnecessary in that
patch, because as the comment points out, the backend will try to open
it during post-bootstrap initialization. I think it was simply a
mistake to try to do this validation in commit 27b62377b4.

The later validation patch does do some better validation at initdb
time to make sure the language can be found.

> [PATCH v8 3/4] Canonicalize ICU locale names to language tags.
> 
> I'm still on the fence about whether we actually want to do this, but
> I'm warming up to it, now that the issues with pre-54 versions are
> fixed.
> 
> But if we do this, the documentation needs to be updated.  There is a
> bunch of text there that says, like, you can do this format or that 
> format, whatever you like.  At least the guidance should be changed
> there.
> 
> 
> [PATCH v8 4/4] Validate ICU locales.
> 
> I would make icu_locale_validation true by default.

Agreed. I considered also not having a GUC, but it seems like some kind
of escape hatch is wise, at least for now.

> Or maybe it should be a log-level type option, so you can set it to 
> error, warning, and also completely off?

As the validation patch seems closer to acceptance, I changed it to be
before the canonicalization patch. New series attached.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 4fdcac30accb54725f05895cb5240af0bb98cb64 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Thu, 23 Mar 2023 21:50:47 -0700
Subject: [PATCH v9 1/5] Fix error inconsistency in older ICU versions.

To support older ICU versions, we rely on
icu_set_collation_attributes() to do error checking that is handled
directly by ucol_open() in newer ICU versions. Commit 3b50275b12
introduced a slight inconsistency, where the error report includes the
fixed-up locale string, rather than the locale string passed to
pg_ucol_open().

Refactor slightly so that pg_ucol_open() handles the errors from both
ucol_open() and icu_set_collation_attributes(), making it easier to
see any differences between the error reports. It also makes
pg_ucol_open() responsible for closing the UCollator on error, which
seems like the right place.

Discussion: https://postgr.es/m/04182066-7655-344a-b8b7-040b1b2490fb%40enterprisedb.com
Reviewed-by: Peter Eisentraut
---
 src/backend/utils/adt/pg_locale.c | 60 +++++++++++++++++--------------
 1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 386768ee76..3db27b34ba 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -147,7 +147,8 @@ static size_t uchar_length(UConverter *converter,
 static int32_t uchar_convert(UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, int32_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(UCollator *collator, const char *loc,
+										 UErrorCode *status);
 #endif
 
 /*
@@ -2503,6 +2504,7 @@ pg_ucol_open(const char *loc_str)
 {
 	UCollator  *collator;
 	UErrorCode	status;
+	const char *orig_str = loc_str;
 	char	   *fixed_str = NULL;
 
 	/*
@@ -2551,11 +2553,27 @@ pg_ucol_open(const char *loc_str)
 	collator = ucol_open(loc_str, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
+				/* use original string for error report */
 				(errmsg("could not open collator for locale \"%s\": %s",
-						loc_str, u_errorName(status))));
+						orig_str, u_errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, loc_str);
+	{
+		status = U_ZERO_ERROR;
+		icu_set_collation_attributes(collator, loc_str, &status);
+
+		/*
+		 * Pretend the error came from ucol_open(), for consistent error
+		 * message across ICU versions.
+		 */
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							orig_str, u_errorName(status))));
+		}
+	}
 
 	if (fixed_str != NULL)
 		pfree(fixed_str);
@@ -2705,9 +2723,9 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(UCollator *collator, const char *loc,
+							 UErrorCode *status)
 {
-	UErrorCode	status;
 	int32_t		len;
 	char	   *icu_locale_id;
 	char	   *lower_str;
@@ -2720,15 +2738,15 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	 * locale ID, e.g. "und@colcaselevel=yes;colstrength=primary", by
 	 * uloc_canonicalize().
 	 */
-	status = U_ZERO_ERROR;
-	len = uloc_canonicalize(loc, NULL, 0, &status);
+	*status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, status);
 	icu_locale_id = palloc(len + 1);
-	status = U_ZERO_ERROR;
-	len = uloc_canonicalize(loc, icu_locale_id, len + 1, &status);
-	if (U_FAILURE(status))
+	*status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, status);
+	if (U_FAILURE(*status))
 		ereport(ERROR,
 				(errmsg("canonicalization failed for locale string \"%s\": %s",
-						loc, u_errorName(status))));
+						loc, u_errorName(*status))));
 
 	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
 
@@ -2750,7 +2768,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
 
-			status = U_ZERO_ERROR;
+			*status = U_ZERO_ERROR;
 
 			*e = '\0';
 			name = token;
@@ -2800,22 +2818,12 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			else if (strcmp(value, "upper") == 0)
 				uvalue = UCOL_UPPER_FIRST;
 			else
-				status = U_ILLEGAL_ARGUMENT_ERROR;
-
-			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
-
-			/*
-			 * Pretend the error came from ucol_open(), for consistent error
-			 * message across ICU versions.
-			 */
-			if (U_FAILURE(status))
 			{
-				ucol_close(collator);
-				ereport(ERROR,
-						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+				*status = U_ILLEGAL_ARGUMENT_ERROR;
+				break;
 			}
+
+			ucol_setAttribute(collator, uattr, uvalue, status);
 		}
 	}
 
-- 
2.34.1

From a9a899eb97231625e6b36d5d8d45fab4720852bc Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 24 Mar 2023 16:09:40 -0700
Subject: [PATCH v9 2/5] initdb: replace check_icu_locale() with
 default_icu_locale().

The extra checks done in check_icu_locale() are not necessary. An
existing comment already pointed out that the checks would be done
during post-bootstrap initialization, when the locale is opened by the
backend. This was a mistake in commit 27b62377b4.

This commit creates a simpler function default_icu_locale() to just
return the locale of the default collator.

Discussion: https://postgr.es/m/04182066-7655-344a-b8b7-040b1b2490fb%40enterprisedb.com
---
 src/bin/initdb/initdb.c | 61 +++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 30 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index bae97539fc..6c1641e77e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2242,49 +2242,47 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+
 /*
- * Check that ICU accepts the locale name; or if not specified, retrieve the
- * default ICU locale.
+ * Determine default ICU locale by opening the default collator and reading
+ * its locale.
+ *
+ * NB: The default collator (opened using NULL) is different from the collator
+ * for the root locale (opened with "", "und", or "root"). The former depends
+ * on the environment (useful at initdb time) and the latter does not.
  */
-static void
-check_icu_locale(void)
+static char *
+default_icu_locale(void)
 {
-#ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	const char	*valid_locale;
+	char		*default_locale;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = ucol_open(NULL, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for default locale: %s",
+				 u_errorName(status));
+
+	status = U_ZERO_ERROR;
+	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+										&status);
 	if (U_FAILURE(status))
 	{
-		if (icu_locale)
-			pg_fatal("could not open collator for locale \"%s\": %s",
-					 icu_locale, u_errorName(status));
-		else
-			pg_fatal("could not open collator for default locale: %s",
-					 u_errorName(status));
+		ucol_close(collator);
+		pg_fatal("could not determine default ICU locale");
 	}
 
-	/* if not specified, get locale from default collator */
-	if (icu_locale == NULL)
-	{
-		const char	*default_locale;
+	default_locale = pg_strdup(valid_locale);
 
-		status = U_ZERO_ERROR;
-		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
-											  &status);
-		if (U_FAILURE(status))
-		{
-			ucol_close(collator);
-			pg_fatal("could not determine default ICU locale");
-		}
+	ucol_close(collator);
 
-		icu_locale = pg_strdup(default_locale);
-	}
+	return default_locale;
+}
 
-	ucol_close(collator);
 #endif
-}
 
 /*
  * set up the locale variables
@@ -2339,13 +2337,16 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
-		check_icu_locale();
+#ifdef USE_ICU
+		/* acquire default locale from the environment, if not specified */
+		if (icu_locale == NULL)
+			icu_locale = default_icu_locale();
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
 		 * backend during post-bootstrap initialization.
 		 */
-#ifndef USE_ICU
+#else
 		pg_fatal("ICU is not supported in this build");
 #endif
 	}
-- 
2.34.1

From c5a2c2ce0f1dc3dc8d83f3d8d257c077b0fa4f96 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 24 Mar 2023 16:09:44 -0700
Subject: [PATCH v9 3/5] initdb: emit message when using default ICU locale.

Helpful to determine from test logs whether the locale came from the
environment or a command-line option.

Discussion: https://postgr.es/m/04182066-7655-344a-b8b7-040b1b2490fb%40enterprisedb.com
---
 src/bin/initdb/initdb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 6c1641e77e..508c522fb7 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2340,7 +2340,10 @@ setlocales(void)
 #ifdef USE_ICU
 		/* acquire default locale from the environment, if not specified */
 		if (icu_locale == NULL)
+		{
 			icu_locale = default_icu_locale();
+			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
+		}
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
-- 
2.34.1

From b27a1a51ea381a5b37cc3b47a93c4d4bb850b727 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Fri, 17 Mar 2023 09:55:31 -0700
Subject: [PATCH v9 4/5] Validate ICU locales.

Ensure that it can be transformed into a language tag in "strict" mode
(which validates the attributes), that the language exists in ICU, and
that it can be opened.

Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important to avoid such mistakes in ICU
versions 54 and earlier, where the same (misspelled) locale string
could fall back to different locales depending on the environment.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.ca...@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.ca...@j-davis.com
---
 doc/src/sgml/config.sgml                      | 26 +++++++
 src/backend/commands/collationcmds.c          | 10 +++
 src/backend/commands/dbcommands.c             | 10 ++-
 src/backend/utils/adt/pg_locale.c             | 70 +++++++++++++++----
 src/backend/utils/misc/guc_tables.c           | 26 +++++++
 src/backend/utils/misc/postgresql.conf.sample |  3 +
 src/bin/initdb/initdb.c                       | 58 ++++++++++++++-
 src/bin/initdb/t/001_initdb.pl                | 18 +++++
 src/include/utils/pg_locale.h                 |  3 +-
 .../regress/expected/collate.icu.utf8.out     |  7 +-
 src/test/regress/sql/collate.icu.utf8.sql     |  5 +-
 11 files changed, 218 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 481f93cea1..56042a0da8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9784,6 +9784,32 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-validation-level" xreflabel="icu_validation_level">
+      <term><varname>icu_validation_level</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_validation_level</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When ICU locale validation problems are encountered, controls which
+        <link linkend="runtime-config-severity-levels">message level</link> is
+        used to report the problem. Valid values are
+        <literal>DISABLED</literal>, <literal>DEBUG5</literal>,
+        <literal>DEBUG4</literal>, <literal>DEBUG3</literal>,
+        <literal>DEBUG2</literal>, <literal>DEBUG1</literal>,
+        <literal>INFO</literal>, <literal>NOTICE</literal>,
+        <literal>WARNING</literal>, <literal>ERROR</literal>, and
+        <literal>LOG</literal>.
+       </para>
+       <para>
+        If set to <literal>DISABLED</literal>, does not report validation
+        problems at all. Otherwise reports problems at the given message
+        level. The default is <literal>ERROR</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 3d0aea0568..2b734807fb 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -254,10 +254,20 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			if (!IsBinaryUpgrade)
+				icu_validate_locale(colliculocale);
+
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 4d5d5d6866..9cac2df70e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1058,7 +1059,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
-		check_icu_locale(dbiculocale);
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+			icu_validate_locale(dbiculocale);
+
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 3db27b34ba..251476ac89 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -95,6 +95,8 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+int			icu_validation_level = ERROR;
+
 /*
  * lc_time localization cache.
  *
@@ -2830,26 +2832,70 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 	pfree(lower_str);
 }
 
-#endif							/* USE_ICU */
-
 /*
- * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
+ * Perform best-effort check that the locale is a valid one.
  */
 void
-check_icu_locale(const char *icu_locale)
+icu_validate_locale(const char *loc_str)
 {
-#ifdef USE_ICU
-	UCollator  *collator;
+	UCollator	*collator;
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	bool		 found	 = false;
+
+	/* no validation */
+	if (icu_validation_level < 0)
+		return;
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(icu_validation_level,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+		return;
+	}
+
+	/* check for special language name */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		found = true;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; !found && i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(icu_validation_level,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			found = true;
+	}
 
-	collator = pg_ucol_open(icu_locale);
+	if (!found)
+		ereport(icu_validation_level,
+				(errmsg("locale \"%s\" has unknown language \"%s\"",
+						loc_str, lang)));
+
+	/* check that it can be opened */
+	collator = pg_ucol_open(loc_str);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..e9dd333f73 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -164,6 +164,22 @@ static const struct config_enum_entry intervalstyle_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry icu_validation_level_options[] = {
+	{"disabled", -1, false},
+	{"debug5", DEBUG5, false},
+	{"debug4", DEBUG4, false},
+	{"debug3", DEBUG3, false},
+	{"debug2", DEBUG2, false},
+	{"debug1", DEBUG1, false},
+	{"debug", DEBUG2, true},
+	{"log", LOG, false},
+	{"info", INFO, true},
+	{"notice", NOTICE, false},
+	{"warning", WARNING, false},
+	{"error", ERROR, false},
+	{NULL, 0, false}
+};
+
 StaticAssertDecl(lengthof(intervalstyle_options) == (INTSTYLE_ISO_8601 + 2),
 				 "array length mismatch");
 
@@ -4630,6 +4646,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"icu_validation_level", PGC_USERSET, CLIENT_CONN_LOCALE,
+		 gettext_noop("Log level for reporting invalid ICU locale strings."),
+		 NULL
+		},
+		&icu_validation_level,
+		ERROR, icu_validation_level_options,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_error_verbosity", PGC_SUSET, LOGGING_WHAT,
 			gettext_noop("Sets the verbosity of logged messages."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..d661b93f8c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,9 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_validation_level = ERROR		# report ICU locale validation
+					# errors at the given level
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 508c522fb7..4db9c310c8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2244,6 +2244,58 @@ check_icu_locale_encoding(int user_enc)
 
 #ifdef USE_ICU
 
+/*
+ * Perform best-effort check that the locale is a valid one. Should be
+ * consistent with pg_locale.c, except that it doesn't need to open the
+ * collator (that will happen during post-bootstrap initialization).
+ */
+static void
+icu_validate_locale(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	bool		 found	 = false;
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+		return;
+	}
+
+	/* check for special language name */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		found = true;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; !found && i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			pg_fatal("could not get language from locale \"%s\": %s",
+					 loc_str, u_errorName(status));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			found = true;
+	}
+
+	if (!found)
+		pg_fatal("locale \"%s\" has unknown language \"%s\"",
+				 loc_str, lang);
+}
+
 /*
  * Determine default ICU locale by opening the default collator and reading
  * its locale.
@@ -2345,9 +2397,11 @@ setlocales(void)
 			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 		}
 
+		icu_validate_locale(icu_locale);
+
 		/*
-		 * In supported builds, the ICU locale ID will be checked by the
-		 * backend during post-bootstrap initialization.
+		 * In supported builds, the ICU locale ID will be opened during
+		 * post-bootstrap initialization, which will perform extra checks.
 		 */
 #else
 		pg_fatal("ICU is not supported in this build");
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b97420f7e8..db7995fe28 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -128,6 +128,24 @@ if ($ENV{with_icu} eq 'yes')
 		],
 		qr/error: encoding mismatch/,
 		'fails for encoding not supported by ICU');
+
+	command_fails_like(
+		[
+			'initdb',                '--no-sync',
+			'--locale-provider=icu',
+			'--icu-locale=nonsense-nowhere', "$tempdir/dataX"
+		],
+		qr/error: locale "nonsense-nowhere" has unknown language "nonsense"/,
+		'fails for nonsense language');
+
+	command_fails_like(
+		[
+			'initdb',                '--no-sync',
+			'--locale-provider=icu',
+			'--icu-locale=@colNumeric=lower', "$tempdir/dataX"
+		],
+		qr/could not open collator for locale "\@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR/,
+		'fails for invalid collation argument');
 }
 else
 {
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index dd822a68be..d2252dc95e 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT int	 icu_validation_level;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -121,8 +122,8 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern void icu_validate_locale(const char *loc_str);
 #endif
-extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
 extern size_t wchar2char(char *to, const wchar_t *from, size_t tolen,
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f135200c99..4158b3c15a 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1035,7 +1035,12 @@ END
 $$;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+ERROR:  locale "nonsense-nowhere" has unknown language "nonsense"
+SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+WARNING:  locale "nonsense-nowhere" has unknown language "nonsense"
+RESET icu_validation_level;
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8105ebc8ae..95d96f2eb8 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -371,7 +371,10 @@ BEGIN
 END
 $$;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+RESET icu_validation_level;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
-- 
2.34.1

From d7959ebbf28e9d4054765c86525005f0d7707078 Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v9 5/5] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.ca...@j-davis.com
---
 src/backend/commands/collationcmds.c          | 51 ++++++------
 src/backend/commands/dbcommands.c             | 25 ++++++
 src/backend/utils/adt/pg_locale.c             | 79 +++++++++++++++++++
 src/bin/initdb/initdb.c                       | 77 ++++++++++++++++++
 src/bin/initdb/t/001_initdb.pl                |  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 31 +++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 14 ++++
 9 files changed, 257 insertions(+), 27 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 2b734807fb..f34d888208 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -260,9 +265,31 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
 
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
 			if (!IsBinaryUpgrade)
-				icu_validate_locale(colliculocale);
+			{
+				char *langtag = icu_language_tag(colliculocale,
+												 icu_validation_level);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
 
+				icu_validate_locale(colliculocale);
+			}
 #else
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -586,26 +613,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -967,7 +974,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, ERROR);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9cac2df70e..e91b14f722 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1059,8 +1059,33 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
 		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			char *langtag = icu_language_tag(dbiculocale,
+											 icu_validation_level);
+
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+
 			icu_validate_locale(dbiculocale);
+		}
 
 #else
 		ereport(ERROR,
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 251476ac89..a337b7a9c2 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2832,6 +2832,85 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 	pfree(lower_str);
 }
 
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, int elevel)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if ((status == U_BUFFER_OVERFLOW_ERROR ||
+			 (U_SUCCESS(status) && len >= buflen)) &&
+			buflen < MaxAllocSize)
+		{
+			buflen = Min(buflen * 2, MaxAllocSize);
+			langtag = repalloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not convert locale name \"%s\" to language tag: %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	return langtag;
+}
+
 /*
  * Perform best-effort check that the locale is a valid one.
  */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4db9c310c8..cd19420c3b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2244,6 +2244,74 @@ check_icu_locale_encoding(int user_enc)
 
 #ifdef USE_ICU
 
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if (status == U_BUFFER_OVERFLOW_ERROR ||
+			(U_SUCCESS(status) && len >= buflen))
+		{
+			buflen = buflen * 2;
+			langtag = pg_realloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+}
+
 /*
  * Perform best-effort check that the locale is a valid one. Should be
  * consistent with pg_locale.c, except that it doesn't need to open the
@@ -2390,6 +2458,8 @@ setlocales(void)
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
 #ifdef USE_ICU
+		char *langtag;
+
 		/* acquire default locale from the environment, if not specified */
 		if (icu_locale == NULL)
 		{
@@ -2397,6 +2467,13 @@ setlocales(void)
 			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 		}
 
+		/* canonicalize to a language tag */
+		langtag = icu_language_tag(icu_locale);
+		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+			   langtag, icu_locale);
+		pg_free(icu_locale);
+		icu_locale = langtag;
+
 		icu_validate_locale(icu_locale);
 
 		/*
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index db7995fe28..17a444d80c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -144,7 +144,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu',
 			'--icu-locale=@colNumeric=lower', "$tempdir/dataX"
 		],
-		qr/could not open collator for locale "\@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR/,
+		qr/could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR/,
 		'fails for invalid collation argument');
 }
 else
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a22f27f300..0b38c0537b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1837,9 +1837,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index d2252dc95e..78f7d5d0d9 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -122,6 +122,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_language_tag(const char *loc_str, int elevel);
 extern void icu_validate_locale(const char *loc_str);
 #endif
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 4158b3c15a..bb75aa1af6 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1019,6 +1019,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1033,12 +1034,17 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
 ERROR:  locale "nonsense-nowhere" has unknown language "nonsense"
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 SET icu_validation_level = WARNING;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
 WARNING:  locale "nonsense-nowhere" has unknown language "nonsense"
 RESET icu_validation_level;
 CREATE COLLATION test4 FROM nonsense;
@@ -1167,14 +1173,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1182,7 +1192,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1190,13 +1202,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1204,10 +1219,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
 (1 row)
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR
+NOTICE:  using language tag "und-u-kn-lower" for locale "@colNumeric=lower"
+ERROR:  could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1216,6 +1233,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1243,10 +1261,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,7 +1317,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1311,6 +1334,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1318,6 +1342,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1778,7 +1803,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 95d96f2eb8..ed8a4b90ff 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -357,6 +357,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -370,8 +372,12 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
 SET icu_validation_level = WARNING;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 RESET icu_validation_level;
@@ -457,10 +463,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -470,7 +480,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -659,7 +671,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

Reply via email to