Re: Order changes in PG16 since ICU introduction

Peter Eisentraut Mon, 15 May 2023 03:06:56 -0700

On 11.05.23 23:29, Jeff Davis wrote:

New patch series attached.


=== 0001: fix bug that allows creating hidden collations

Bug:
https://www.postgresql.org/message-id/[email protected]

This is still being debated in the other thread. Not really related tothis thread, so I suggest dropping it from this patch series.

=== 0002: handle some kinds of libc-stlye locale strings

ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't
in later versions. Handle them in postgres for consistency.

I tend to agree with ICU that these variants are obsolete, and we don'tneed to support them anymore. If this were a tiny patch, then maybe ok,but the way it's presented here the whole code is duplicated betweenpg_locale.c and initdb.c, which is not great.

=== 0003: reduce icu_validation_level to WARNING

Given that we've seen some inconsistency in which locale names are
accepted in different ICU versions, it seems best not to be too strict.
Peter Eisentraut suggested that it be set to ERROR originally, but a
WARNING should be sufficient to see problems without introducing risks
migrating to version 16.

I'm not sure why this is the conclusion. Presumably, the detectioncapabilities of ICU improve over time, so we want to take advantage ofthat? What are some example scenarios where this change would help?

=== 0004-0006:

To solve the issues that have come up in this thread, we need CREATE
DATABASE (and createdb and initdb) to use LOCALE to mean the collation
locale regardless of which provider is in use (which is what 0006
does).

0006 depends on ICU handling libc locale names. It already does a good
job for most libc locale names (though patch 0002 fixes a few cases
where it doesn't). There may be more cases, but for the most part libc
names are interpreted in a reasonable way. But one important case is
missing: ICU does not handle the "C" locale as we expect (that is,
using memcmp()).

We've already allowed users to create ICU collations with the C locale
in the past, which uses the root collation (not memcmp()), and we need
to keep supporting that for upgraded clusters.

I'm not sure I agree that we need to keep supporting that. The only wayyou could get that in past releases is if you specify explicitly, "giveme provider ICU and locale C", and then it wouldn't actually even workcorrectly. So nobody should be using that in practice, and nobodyshould have stumbled into that combination of settings by accident.

   3. Introduce collation provider "none", which is always memcmp-based
(patch 0004). It's equivalent to the libc locale=C, but it allows
specifying the LC_COLLATE and LC_CTYPE independently. A command like
CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C'
LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none"
(patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For
the database default collation, that would always use memcmp(), but the
server environment LC_COLLATE would be set to en_US as the user
specified.

This seems most attractive, but I think it's quite invasive at thispoint, especially given the dubious premise (see above).

=== 0007: Add a GUC to control the default collation provider

Having a GUC would make it easier to migrate to ICU without surprises.
This only affects the default for CREATE COLLATION, not CREATE DATABASE
(and obviously not initdb).

It's not clear to me why we would want that. Also not clear why itshould only affect CREATE COLLATION.

Re: Order changes in PG16 since ICU introduction

Reply via email to