On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote:
> Jeff Davis <pg...@j-davis.com> writes:
> > What should we do with locales like C.UTF-8 in both libc and ICU? 
> 
> I vote for passing those to the existing C-specific code paths,

Great, this would be a big step toward solving the ICU usability issues
in this thread:

https://postgr.es/m/000b01d97465%24c34bbd60%2449e33820%24%40pcorp.us

> Probably "C", or "C.anything", or "POSIX", or "POSIX.anything".
> Case-independent might be good, but we haven't accepted such in
> the past, so I don't feel strongly about it.  (Arguably, passing
> lower case "c" to the provider would provide an "out" to anybody
> who dislikes our choices here.)

Patch attached with your suggestions. It's based on the first patch in
the series I posted here:

https://postgr.es/m/a4388fa3acabf7794ac39fdb471ad97eebdfbe11.ca...@j-davis.com

We still need to consider backwards compatibility. If someone has a
collation with locale name C.UTF-8 in an earlier version, any change to
the interpretation of that locale name after an upgrade carries a
corruption risk. The risks are different in ICU vs libc:

  For ICU: iculocale=C in an earlier version was a mistake that must
have been explicitly requested by the user. However, if such a mistake
was made, the indexes would have been created using the ICU root
locale, which is very different from the C locale. So reinterpreting
iculocale=C as memcmp() would be likely to result in index corruption.
Patch 0002 (also based on a patch from the series linked above) solves
this with a pg_upgrade check for iculocale=C in versions 15 and
earlier. The upgrade check is not likely to affect many users, and
those it does affect have a mis-defined collation and would benefit
from the check.

  For libc: this change may affect any user who happened to have
LANG=C.UTF-8 in their environment at initdb time, which is probably a
lot of users, and some buildfarm members. However, the average risk
seems to be much lower, because we've gone a long time with the
assumption that C.UTF-8 has the same behavior as C, and this only
recently came up. Also, I'm not sure how obscure the cases are even if
there is a difference; perhaps they don't often occur in practice? It's
not clear to me how we mitigate this risk further, though.

Regards,
        Jeff Davis

From c6721272200c14931ad757185a3aaeb615c432ed Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Mon, 24 Apr 2023 15:46:17 -0700
Subject: [PATCH 1/2] Interpret C locales consistently between ICU and libc.

Treat a locale named C, C.anything, POSIX, or POSIX.anything as
equivalent to the C locale; implemented with built-in semantics
(memcmp() for collation and pg_ascii_*() for ctype).

Such locales are not passed to the provider at all, so have identical
behavior regardless of whether it's declared with provider ICU or
libc.

Previously, only C and POSIX locales had this behavior (not
e.g. "C.UTF-8"), and only if the provider was declared as libc. That
caused problems on libc for locales like C.UTF-8, which may have
subtly different behavior in some versions of libc; and it caused
problems on ICU because newer versions don't recognize C locales.

Discussion: https://postgr.es/m/1559006.1685040...@sss.pgh.pa.us
Discussion: https://postgr.es/m/c840107b-4cb9-c8e9-abb7-1d8c5e0d51df%40enterprisedb.com
Discussion: https://postgr.es/m/87v8hoexdv....@news-spur.riddles.org.uk
---
 doc/src/sgml/charset.sgml                     |   3 +-
 src/backend/commands/collationcmds.c          |  42 +++---
 src/backend/commands/dbcommands.c             |  41 +++---
 src/backend/utils/adt/pg_locale.c             | 126 +++++++++++++-----
 src/backend/utils/init/postinit.c             |   4 +-
 src/backend/utils/mb/mbutils.c                |   3 +-
 src/include/utils/pg_locale.h                 |   1 +
 .../regress/expected/collate.icu.utf8.out     |   6 +
 src/test/regress/expected/collate.out         |   5 +
 src/test/regress/sql/collate.icu.utf8.sql     |   4 +
 src/test/regress/sql/collate.sql              |   5 +
 11 files changed, 167 insertions(+), 73 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index ed84465996..8ba3117557 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -136,7 +136,8 @@ initdb --locale=sv_SE
    <para>
     If you want the system to behave as if it had no locale support,
     use the special locale name <literal>C</literal>, or equivalently
-    <literal>POSIX</literal>.
+    <literal>POSIX</literal>. An encoding may also be appended, for
+    example <literal>C.UTF-8</literal>.
    </para>
 
    <para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 2969a2bb21..a451ae8843 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -264,26 +264,38 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
 
-			/*
-			 * During binary upgrade, preserve the locale string. Otherwise,
-			 * canonicalize to a language tag.
-			 */
-			if (!IsBinaryUpgrade)
+			if (locale_name_is_c(colliculocale))
 			{
-				char	   *langtag = icu_language_tag(colliculocale,
-													   icu_validation_level);
-
-				if (langtag && strcmp(colliculocale, langtag) != 0)
+				if (!collisdeterministic)
+					ereport(ERROR,
+							(errmsg("nondeterministic collations not supported for C or POSIX locale")));
+				if (collicurules != NULL)
+					ereport(ERROR,
+							(errmsg("RULES not supported for C or POSIX locale")));
+			}
+			else
+			{
+				/*
+				 * During binary upgrade, preserve the locale
+				 * string. Otherwise, canonicalize to a language tag.
+				 */
+				if (!IsBinaryUpgrade)
 				{
-					ereport(NOTICE,
-							(errmsg("using standard form \"%s\" for locale \"%s\"",
-									langtag, colliculocale)));
+					char	   *langtag = icu_language_tag(colliculocale,
+														   icu_validation_level);
+
+					if (langtag && strcmp(colliculocale, langtag) != 0)
+					{
+						ereport(NOTICE,
+								(errmsg("using standard form \"%s\" for locale \"%s\"",
+										langtag, colliculocale)));
 
-					colliculocale = langtag;
+						colliculocale = langtag;
+					}
 				}
-			}
 
-			icu_validate_locale(colliculocale);
+				icu_validate_locale(colliculocale);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 99d4080ea9..601a08ef11 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1058,27 +1058,36 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
-		/*
-		 * During binary upgrade, or when the locale came from the template
-		 * database, preserve locale string. Otherwise, canonicalize to a
-		 * language tag.
-		 */
-		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		if (locale_name_is_c(dbiculocale))
 		{
-			char	   *langtag = icu_language_tag(dbiculocale,
-												   icu_validation_level);
-
-			if (langtag && strcmp(dbiculocale, langtag) != 0)
+			if (dbicurules != NULL)
+				ereport(ERROR,
+						(errmsg("ICU_RULES not supported for C or POSIX locale")));
+		}
+		else
+		{
+			/*
+			 * During binary upgrade, or when the locale came from the
+			 * template database, preserve locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
 			{
-				ereport(NOTICE,
-						(errmsg("using standard form \"%s\" for locale \"%s\"",
-								langtag, dbiculocale)));
+				char	   *langtag = icu_language_tag(dbiculocale,
+													   icu_validation_level);
+
+				if (langtag && strcmp(dbiculocale, langtag) != 0)
+				{
+					ereport(NOTICE,
+							(errmsg("using standard form \"%s\" for locale \"%s\"",
+									langtag, dbiculocale)));
 
-				dbiculocale = langtag;
+					dbiculocale = langtag;
+				}
 			}
-		}
 
-		icu_validate_locale(dbiculocale);
+			icu_validate_locale(dbiculocale);
+		}
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 31e3b16ae0..2f2734a405 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1239,15 +1239,19 @@ lookup_collation_cache(Oid collation, bool set_flags)
 			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_collctype);
 			collctype = TextDatumGetCString(datum);
 
-			cache_entry->collate_is_c = ((strcmp(collcollate, "C") == 0) ||
-										 (strcmp(collcollate, "POSIX") == 0));
-			cache_entry->ctype_is_c = ((strcmp(collctype, "C") == 0) ||
-									   (strcmp(collctype, "POSIX") == 0));
+			cache_entry->collate_is_c = locale_name_is_c(collcollate);
+			cache_entry->ctype_is_c = locale_name_is_c(collctype);
 		}
 		else
 		{
-			cache_entry->collate_is_c = false;
-			cache_entry->ctype_is_c = false;
+			Datum		datum;
+			const char *colliculocale;
+
+			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_colliculocale);
+			colliculocale = TextDatumGetCString(datum);
+
+			cache_entry->collate_is_c = locale_name_is_c(colliculocale);
+			cache_entry->ctype_is_c = cache_entry->collate_is_c;
 		}
 
 		cache_entry->flags_valid = true;
@@ -1258,6 +1262,22 @@ lookup_collation_cache(Oid collation, bool set_flags)
 	return cache_entry;
 }
 
+/*
+ * Check if the locale name should be handled like the C locale.
+ *
+ * If so, the locale should be handled with built-in memcmp() and
+ * pg_ascii_*(); otherwise, the locale should be handled by the collation
+ * provider.
+ */
+bool
+locale_name_is_c(const char *locale)
+{
+	if (strcmp(locale, "C") == 0 || strncmp(locale, "C.", 2) == 0 ||
+		strcmp(locale, "POSIX") == 0 || strncmp(locale, "POSIX.", 6) == 0)
+		return true;
+
+	return false;
+}
 
 /*
  * Detect whether collation's LC_COLLATE property is C
@@ -1279,23 +1299,30 @@ lc_collate_is_c(Oid collation)
 	if (collation == DEFAULT_COLLATION_OID)
 	{
 		static int	result = -1;
-		char	   *localeptr;
-
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return false;
+		const char *localeptr;
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_COLLATE, NULL);
-		if (!localeptr)
-			elog(ERROR, "invalid LC_COLLATE setting");
-
-		if (strcmp(localeptr, "C") == 0)
-			result = true;
-		else if (strcmp(localeptr, "POSIX") == 0)
-			result = true;
+
+		if (default_locale.provider == COLLPROVIDER_ICU)
+		{
+#ifdef USE_ICU
+			localeptr = default_locale.info.icu.locale;
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
+		}
 		else
-			result = false;
+		{
+			localeptr = setlocale(LC_COLLATE, NULL);
+			if (!localeptr)
+				elog(ERROR, "invalid LC_COLLATE setting");
+		}
+
+		result = locale_name_is_c(localeptr);
+
 		return (bool) result;
 	}
 
@@ -1332,23 +1359,30 @@ lc_ctype_is_c(Oid collation)
 	if (collation == DEFAULT_COLLATION_OID)
 	{
 		static int	result = -1;
-		char	   *localeptr;
-
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return false;
+		const char *localeptr;
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_CTYPE, NULL);
-		if (!localeptr)
-			elog(ERROR, "invalid LC_CTYPE setting");
-
-		if (strcmp(localeptr, "C") == 0)
-			result = true;
-		else if (strcmp(localeptr, "POSIX") == 0)
-			result = true;
+
+		if (default_locale.provider == COLLPROVIDER_ICU)
+		{
+#ifdef USE_ICU
+			localeptr = default_locale.info.icu.locale;
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
+		}
 		else
-			result = false;
+		{
+			localeptr = setlocale(LC_CTYPE, NULL);
+			if (!localeptr)
+				elog(ERROR, "invalid LC_CTYPE setting");
+		}
+
+		result = locale_name_is_c(localeptr);
+
 		return (bool) result;
 	}
 
@@ -1375,7 +1409,13 @@ make_icu_collator(const char *iculocstr,
 #ifdef USE_ICU
 	UCollator  *collator;
 
-	collator = pg_ucol_open(iculocstr);
+	if (locale_name_is_c(iculocstr))
+	{
+		Assert(icurules == NULL);
+		collator = NULL;
+	}
+	else
+		collator = pg_ucol_open(iculocstr);
 
 	/*
 	 * If rules are specified, we extract the rules of the standard collation,
@@ -1525,6 +1565,9 @@ pg_newlocale_from_collation(Oid collid)
 			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_collctype);
 			collctype = TextDatumGetCString(datum);
 
+			Assert(!locale_name_is_c(collcollate));
+			Assert(!locale_name_is_c(collctype));
+
 			if (strcmp(collcollate, collctype) == 0)
 			{
 				/* Normal case where they're the same */
@@ -1581,6 +1624,8 @@ pg_newlocale_from_collation(Oid collid)
 			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_colliculocale);
 			iculocstr = TextDatumGetCString(datum);
 
+			Assert(!locale_name_is_c(iculocstr));
+
 			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 			if (!isnull)
 				icurules = TextDatumGetCString(datum);
@@ -1650,6 +1695,9 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 {
 	char	   *collversion = NULL;
 
+	if (locale_name_is_c(collcollate))
+		return NULL;
+
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
@@ -1667,10 +1715,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	}
 	else
 #endif
-		if (collprovider == COLLPROVIDER_LIBC &&
-			pg_strcasecmp("C", collcollate) != 0 &&
-			pg_strncasecmp("C.", collcollate, 2) != 0 &&
-			pg_strcasecmp("POSIX", collcollate) != 0)
+		if (collprovider == COLLPROVIDER_LIBC)
 	{
 #if defined(__GLIBC__)
 		/* Use the glibc version because we don't have anything better. */
@@ -2457,6 +2502,13 @@ pg_ucol_open(const char *loc_str)
 	if (loc_str == NULL)
 		elog(ERROR, "opening default collator is not supported");
 
+	/*
+	 * Must never open special values C or POSIX, which are treated specially
+	 * and not passed to the provider.
+	 */
+	if (locale_name_is_c(loc_str))
+		elog(ERROR, "unexpected ICU locale string: %s", loc_str);
+
 	/*
 	 * In ICU versions 54 and earlier, "und" is not a recognized spelling of
 	 * the root locale. If the first component of the locale is "und", replace
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 6856ed99e7..92928133c0 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -419,9 +419,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 						   " which is not recognized by setlocale().", ctype),
 				 errhint("Recreate the database with another locale or install the missing locale.")));
 
-	if (strcmp(ctype, "C") == 0 ||
-		strcmp(ctype, "POSIX") == 0)
-		database_ctype_is_c = true;
+	database_ctype_is_c = locale_name_is_c(ctype);
 
 	if (dbform->datlocprovider == COLLPROVIDER_ICU)
 	{
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 67a1ab2ab2..9a54a952e0 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -39,6 +39,7 @@
 #include "mb/pg_wchar.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/pg_locale.h"
 #include "utils/syscache.h"
 #include "varatt.h"
 
@@ -1239,7 +1240,7 @@ pg_bind_textdomain_codeset(const char *domainname)
 #ifndef WIN32
 	const char *ctype = setlocale(LC_CTYPE, NULL);
 
-	if (pg_strcasecmp(ctype, "C") == 0 || pg_strcasecmp(ctype, "POSIX") == 0)
+	if (locale_name_is_c(ctype))
 #endif
 		if (encoding != PG_SQL_ASCII &&
 			raw_pg_bind_textdomain_codeset(domainname, encoding))
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index e2a7243542..0e26346546 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool database_ctype_is_c;
 extern bool check_locale(int category, const char *locale, char **canonname);
 extern char *pg_perm_setlocale(int category, const char *locale);
 
+extern bool locale_name_is_c(const char *locale);
 extern bool lc_collate_is_c(Oid collation);
 extern bool lc_ctype_is_c(Oid collation);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index c658ee1404..79ce33abbd 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1043,12 +1043,18 @@ ERROR:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
 ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'C', deterministic = false); -- fails
+ERROR:  nondeterministic collations not supported for C or POSIX locale
+CREATE COLLATION testx (provider = icu, locale = 'C', rules = '&V << w <<< W'); -- fails
+ERROR:  RULES not supported for C or POSIX locale
 RESET icu_validation_level;
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 WARNING:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 WARNING:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = 'C.UTF-8'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'POSIX'); DROP COLLATION testx;
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0649564485..e2d0a39732 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -649,6 +649,11 @@ EXPLAIN (COSTS OFF)
    ->  Seq Scan on collate_test10
 (3 rows)
 
+-- test alternate spellings of special locale C
+CREATE COLLATION coll_c_locale ( LOCALE = "C.something" );
+DROP COLLATION coll_c_locale;
+CREATE COLLATION coll_c_locale ( LOCALE = "POSIX.something" );
+DROP COLLATION coll_c_locale;
 -- CREATE/DROP COLLATION
 CREATE COLLATION mycoll1 FROM "C";
 CREATE COLLATION mycoll2 ( LC_COLLATE = "POSIX", LC_CTYPE = "POSIX" );
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 7bd0901281..adc6b7deec 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -379,9 +379,13 @@ CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, nee
 SET icu_validation_level = ERROR;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'C', deterministic = false); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'C', rules = '&V << w <<< W'); -- fails
 RESET icu_validation_level;
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'C.UTF-8'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'POSIX'); DROP COLLATION testx;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index c3d40fc195..10ff532169 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -241,6 +241,11 @@ EXPLAIN (COSTS OFF)
 EXPLAIN (COSTS OFF)
   SELECT * FROM collate_test10 ORDER BY x DESC, y COLLATE "C" ASC NULLS FIRST;
 
+-- test alternate spellings of special locale C
+CREATE COLLATION coll_c_locale ( LOCALE = "C.something" );
+DROP COLLATION coll_c_locale;
+CREATE COLLATION coll_c_locale ( LOCALE = "POSIX.something" );
+DROP COLLATION coll_c_locale;
 
 -- CREATE/DROP COLLATION
 
-- 
2.34.1

From aaa116db0e1409b95bf8eb3612d8af9dc8de8abb Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Wed, 24 May 2023 12:15:33 -0700
Subject: [PATCH 2/2] pg_upgrade: check for ICU locale C in versions 15 and
 earlier.

ICU collations with locale "C" (and equivalently "POSIX") were not
supported before version 16, but could still be created. Users with
such collations would not get expected behavior.

Reject upgrading clusters with such collations, so that we can support
"C" and "POSIX" locales for the ICU provider without risk of
corrupting indexes during pg_upgrade.
---
 src/bin/pg_upgrade/check.c | 100 +++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 64024e3b9e..e7163c4c7e 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -26,6 +26,7 @@ static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
+static void check_icu_c_before_16(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
@@ -164,6 +165,9 @@ check_and_dump_old_cluster(bool live_check)
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 905)
 		check_for_pg_role_prefix(&old_cluster);
 
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
+		check_icu_c_before_16(&old_cluster);
+
 	if (GET_MAJOR_VERSION(old_cluster.major_version) == 904 &&
 		old_cluster.controldata.cat_ver < JSONB_FORMAT_CHANGE_CAT_VER)
 		check_for_jsonb_9_4_usage(&old_cluster);
@@ -1233,6 +1237,102 @@ check_for_aclitem_data_type_usage(ClusterInfo *cluster)
 		check_ok();
 }
 
+/*
+ * check_icu_c_before_16
+ *
+ *  Version 16 adds support for the ICU C locale, but it was possible to
+ *  (incorrectly) create it in prior versions. Check for this invalid ICU
+ *  locale name, or variant spellings like "C.UTF-8", in version 15 and
+ *  earlier.
+ */
+static void
+check_icu_c_before_16(ClusterInfo *cluster)
+{
+	PGresult   *dat_res;
+	PGconn	   *template1_conn = connectToServer(cluster, "template1");
+	int			dat_ntups;
+	int			i_datname;
+	FILE	   *script = NULL;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for ICU collations with locale \"C\"");
+
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "icu_c_before_16.txt");
+
+	/* check pg_database */
+	dat_res = executeQueryOrDie(template1_conn,
+								"SELECT datname "
+								"FROM pg_catalog.pg_database "
+								"WHERE datlocprovider='i' "
+								"AND daticulocale ~ '^C(\\..*)?$' "
+								"OR daticulocale ~ '^POSIX(\\..*)?$'");
+
+	i_datname = PQfnumber(dat_res, "datname");
+
+	dat_ntups = PQntuples(dat_res);
+
+	for (int rowno = 0; rowno < dat_ntups; rowno++)
+	{
+		if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+			pg_fatal("could not open file \"%s\": %s",
+					 output_path, strerror(errno));
+		fprintf(script, "default collation for database %s\n",
+				PQgetvalue(dat_res, rowno, i_datname));
+	}
+
+	PQclear(dat_res);
+	PQfinish(template1_conn);
+
+	/* check pg_collation in each database */
+	for (int dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+	{
+		DbInfo	   *active_db = &cluster->dbarr.dbs[dbnum];
+		PGconn	   *db_conn = connectToServer(cluster, active_db->db_name);
+		PGresult   *coll_res;
+		int			coll_ntups;
+		int			i_collid;
+		int			i_collname;
+
+		coll_res = executeQueryOrDie(db_conn,
+									 "SELECT oid as collid, collname "
+									 "FROM pg_catalog.pg_collation "
+									 "WHERE collprovider='i' "
+									 "AND colliculocale ~ '^C(\\..*)?$' "
+									 "OR colliculocale ~ '^POSIX(\\..*)?$'");
+
+		i_collid = PQfnumber(coll_res, "collid");
+		i_collname = PQfnumber(coll_res, "collname");
+
+		coll_ntups = PQntuples(coll_res);
+
+		for (int rowno = 0; rowno < coll_ntups; rowno++)
+		{
+			if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+				pg_fatal("could not open file \"%s\": %s",
+						 output_path, strerror(errno));
+			fprintf(script, "database %s collation %s (oid=%s)\n",
+					active_db->db_name,
+					PQgetvalue(coll_res, rowno, i_collname),
+					PQgetvalue(coll_res, rowno, i_collid));
+		}
+
+		PQclear(coll_res);
+		PQfinish(db_conn);
+	}
+
+	if (script)
+	{
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains ICU collations with the locale \"C\" or \"POSIX\",\n"
+				 "which are not supported until version 16. Earlier versions using ICU \n"
+				 "collations with the \"C\" or \"POSIX\" locales cannot be upgraded.");
+	}
+
+	check_ok();
+}
+
 /*
  * check_for_jsonb_9_4_usage()
  *
-- 
2.34.1

Reply via email to