Re: Order changes in PG16 since ICU introduction
On Fri, 2023-06-16 at 16:50 +0200, Peter Eisentraut wrote: > This looks good to me. > > Attached is small fixup patch with some documentation tweaks and > simplifying some test code (also includes pgperltidy). Thank you. Committed with your fixups. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 14.06.23 23:24, Jeff Davis wrote: On Mon, 2023-06-12 at 23:04 +0200, Peter Eisentraut wrote: Patch 0003: Makes LOCALE apply to all providers. The overall feel after this patch is that "locale" now means the collation locale, and LC_COLLATE/LC_CTYPE are for the server environment. When using libc, LC_COLLATE and LC_CTYPE still work as they did before, but their relationship to database collation feels more like a special case of the libc provider. I believe most people favor this patch and I haven't seen recent objections. This seems reasonable. Attached a clean patch for this. It seems to have widespread agreement so I plan to commit to v16 soon. To clarify, this affects both initdb and CREATE DATABASE. This looks good to me. Attached is small fixup patch with some documentation tweaks and simplifying some test code (also includes pgperltidy). From 0cd2154f364999091aba52136a139df75f58d1b7 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 16 Jun 2023 16:46:36 +0200 Subject: [PATCH] fixup! CREATE DATABASE: make LOCALE apply to all collation providers. --- doc/src/sgml/ref/create_database.sgml | 12 ++-- src/test/icu/t/010_database.pl| 23 +-- 2 files changed, 19 insertions(+), 16 deletions(-) diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml index dab05950ed..b2c8aef1ad 100644 --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -147,9 +147,9 @@ Parameters Sets the default collation order and character classification in the new database. Collation affects the sort order applied to strings, -e.g., in queries with ORDER BY, as well as the order used in indexes +e.g., in queries with ORDER BY, as well as the order used in indexes on text columns. Character classification affects the categorization -of characters, e.g., lower, - upper and digit. Also sets the +of characters, e.g., lower, upper, and digit. Also sets the associated aspects of the operating system environment, LC_COLLATE and LC_CTYPE. The default is the same setting as the template database. See Parameters Sets LC_COLLATE in the database server's operating system environment. The default is the setting of if specified; otherwise the same +linkend="create-database-locale"/> if specified, otherwise the same setting as the template database. See below for additional restrictions. @@ -198,7 +198,7 @@ Parameters Sets LC_CTYPE in the database server's operating system environment. The default is the setting of if specified; otherwise the same +linkend="create-database-locale"/> if specified, otherwise the same setting as the template database. See below for additional restrictions. @@ -218,8 +218,8 @@ Parameters Specifies the ICU locale (see ) for the database default collation order and character classification, overriding the setting -. The must be ICU. The default +. The locale provider must be ICU. The default is the setting of if specified; otherwise the same setting as the template database. diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl index b417b1a409..cbe5467f3c 100644 --- a/src/test/icu/t/010_database.pl +++ b/src/test/icu/t/010_database.pl @@ -52,27 +52,30 @@ # Test that LOCALE='C' works for ICU -my $ret1 = $node1->psql('postgres', - q{CREATE DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8} -); -is($ret1, 0, +is( $node1->psql( + 'postgres', + q{CREATE DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8} + ), + 0, "C locale works for ICU"); # Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE # are specified -my $ret2 = $node1->psql('postgres', - q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' +is( $node1->psql( + 'postgres', + q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' LC_COLLATE='C' LC_CTYPE='C' TEMPLATE template0 ENCODING UTF8} -); -is($ret2, 0, + ), + 0, "LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE are specified"); # Test that ICU-specific LOCALE without LC_COLLATE and LC_CTYPE must # be specified with ICU_LOCALE -my ($ret3, $stdout, $stderr) = $node1->psql('postgres', +my ($ret, $stdout, $stderr) = $node1->psql( + 'postgres', q{CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' TEMPLATE template0 ENCODING UTF8}); -isnt($ret3, 0, +isnt($ret, 0, "ICU-specific locale must be specified with ICU_LOCALE: exit code not 0"); like( $stderr, --
Re: Order changes in PG16 since ICU introduction
On Mon, 2023-06-12 at 23:04 +0200, Peter Eisentraut wrote: > I object to adding a new provider for PG16 (patch 0001). Added to July CF for 17. > > 2. Patch 0004 is possibly out of scope for 16 > Also clearly a new feature. Added to July CF for 17. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 09.06.23 02:36, Jeff Davis wrote: Patches 0001, 0002: These patches implement the built-in provider and automatically change provider=icu to provider=builtin when the locale is C. I object to adding a new provider for PG16 (patch 0001). This is clearly a new feature, which wasn't even contemplated a few weeks ago. * Switch to the libc provider for the C locale: would make the libc provider even more complicated and had some potential for confusion, and also has catalog representation problems when --locale is specified along with --lc-ctype. I don't follow this concern. This could be done entirely within initdb. I mean, just change the default for --locale-provider if --locale=C is given. That's like 10 lines of code in initdb.c. I don't think I want CREATE DATABASE or CREATE COLLATION to have that logic, nor do they really need it. Patch 0003: Makes LOCALE apply to all providers. The overall feel after this patch is that "locale" now means the collation locale, and LC_COLLATE/LC_CTYPE are for the server environment. When using libc, LC_COLLATE and LC_CTYPE still work as they did before, but their relationship to database collation feels more like a special case of the libc provider. I believe most people favor this patch and I haven't seen recent objections. This seems reasonable. 1. If you specify --locale-provider=builtin at initdb time, you *must* specify --locale=C/POSIX, otherwise you get an error. Shouldn't the --locale option be ignored (or rejected) in that case. Why insist on it being specified? 2. Patch 0004 is possibly out of scope for 16, but it felt consistent with the other UI changes and low risk. Please try with/without before objecting. Also clearly a new feature. Also the implications of various upgrade, dump/restore scenarios are not fully explored. I think it's an interesting idea, to make CREATE DATABASE and CREATE COLLATION also default to icu of the respective higher scope has been set to icu. In fact, this makes me wonder now whether changing the default to icu in *only* initdb is sensible. But again, we'd need to see the full matrix of upgrade scenarios laid out here. 3. Daniel Verite felt that we should only change the provider from ICU to "builtin" for the C locale if the provider is defaulting to ICU; not if it's specified as ICU. Correct, we shouldn't override what was explicitly specified. I did not differentiate between specifying ICU and defaulting to ICU because: a. "libc" unconditionally uses the built-in memcmp() logic for C, it never actually uses libc b. If a user really wants the root locale or the en-US-u-va-posix locale, they can specify those directly c. I don't see any plausible case where it helps a user to keep provider=icu when locale=C. If the user specifies that, that's up to them to deal with the outcomes. Just changing it to something different seems wrong. 4. Joe Conway and Peter Eisentraut both felt that C.UTF-8 with provider=icu should not be changed to use the builtin provider, and instead passed on to ICU. I implemented a compromise where initdb will change C.UTF-8 to the built-in provider; but CREATE DATABASE/COLLATION will pass it along to ICU (which may support it as en-US-u-va-posix in some versions, or may throw an error in other versions). My reasoning is that initdb is pulling from the environment, and we should try harder to succeed on any reasonable environmental settings (otherwise initdb with default settings could fail); whereas we can be more strict with CREATE DATABASE/COLLATION. I'm not objecting to changing anything about C.UTF-8, but I am objecting to changing anything substantial in PG16. 5. For the built-in provider, initdb defaults to UTF-8 rather than SQL_ASCII. Otherwise, you would be unable to use ICU at all later, because ICU doesn't support SQL_ASCII. Also a behavior change that is not appropriate for PG16 at this stage.
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > I guess where I'm confused is: why would a user actually want their > database collation to be C.UTF-8? It's slower than C, our > implementation doesn't properly version it (as you pointed out), and > the semantics don't seem great ('Z' < 'a'). Because when LC_CTYPE=C, characters outside of US ASCII are not categorized properly. upper/lower/regexp matching/... produce incorrect results. > But if they don't specify the provider, isn't it much more likely they > just don't care much about the locale, and would be happier with C? Consider a pre-existing script doing initdb --locale=C.UTF-8 Surely it does care about the locale, otherwise it would not specify it. Assuming that it would be better off with C is assuming that a non-Unicode aware locale is better than the Unicode-aware locale they're asking. I don't think it's reasonable. > The user can easily get libc behavior by specifying --locale- > provider=libc, so I don't see how you reached this conclusion. What would be user hostile is forcing users that don't need an ICU locale to change their invocations of initdb/createdb to avoid regressions with v16. Most people would discover this after it breaks their apps. > It looks like you are fine with 0003 applying LOCALE to whatever > provider is chosen, but you'd like to be smarter about choosing the > provider and to choose libc in at least some cases. > > That is actually very much like option #2 in the list I presented > here[2], and has the same problems. How should the following behave? > > initdb --locale=C --lc-collate=fr_FR.utf8 > initdb --locale=en --lc-collate=fr_FR.utf8 The same as v15. > If we switch to libc in the first case, then --locale will be ignored > and the collation will be fr_FR.utf8. $ initdb --locale=C --lc-collate=fr_FR.utf8 v15 does that: The database cluster will be initialized with this locale configuration: provider:libc LC_COLLATE: fr_FR.utf8 LC_CTYPE:C LC_MESSAGES: C LC_MONETARY: C LC_NUMERIC: C LC_TIME: C The default database encoding has accordingly been set to "SQL_ASCII". --locale is not ignored, it's overriden for LC_COLLATE only. > But we will leave the second case as ICU and the collation will be > "en". Yes. To me the rule for "ICU is the default" in v16 should be: if the --locale argument points to a locale that we know ICU does not provide, we fall back to the v15 behavior down to every detail, otherwise we let ICU be the provider. > You also suggested that we consider switching the provider to libc any > time ICU doesn't support something. I'm not sure whether you meant a > static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test. C, C.*, POSIX. I'm not sure if there are other cases. > I'm also not clear whether you think we should abandon the built-in > provider, or still select it for C/POSIX. I see it as going in v17, because it came after feature freeze and is not strictly necessary in v16. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-06-09 at 14:12 +0200, Daniel Verite wrote: > > I implemented a compromise where initdb will > > change C.UTF-8 to the built-in provider > > $ initdb --locale=C.UTF-8 ... > This setup is not what the user has asked for and leads to that kind > of > wrong results: > > $ psql -c "select upper('é')" > ?column? > -- > é > > whereas in v15 we would get the correct result 'É'. I guess where I'm confused is: why would a user actually want their database collation to be C.UTF-8? It's slower than C, our implementation doesn't properly version it (as you pointed out), and the semantics don't seem great ('Z' < 'a'). If the user specifies provider=libc, then of course we should honor that and C.UTF-8 is a valid locale for libc. But if they don't specify the provider, isn't it much more likely they just don't care much about the locale, and would be happier with C? Perhaps there's some better compromise here than the one I picked, but I see this as a fairly small problem in comparison to the big problems that we're solving. > In general about the evolution of the patchset, your interpretation > of "defaulting to ICU" seems to be "avoid libc at any cost", which > IMV > is unreasonably user-hostile. The user can easily get libc behavior by specifying --locale- provider=libc, so I don't see how you reached this conclusion. Let me try to understand and address the points you raised here[1] in more detail: It looks like you are fine with 0003 applying LOCALE to whatever provider is chosen, but you'd like to be smarter about choosing the provider and to choose libc in at least some cases. That is actually very much like option #2 in the list I presented here[2], and has the same problems. How should the following behave? initdb --locale=C --lc-collate=fr_FR.utf8 initdb --locale=en --lc-collate=fr_FR.utf8 If we switch to libc in the first case, then --locale will be ignored and the collation will be fr_FR.utf8. But we will leave the second case as ICU and the collation will be "en". I'm sure we can come up with something there, but it feels like there's more room for confusion along this path, and the builtin provider seems cleaner. You also suggested that we consider switching the provider to libc any time ICU doesn't support something. I'm not sure whether you meant a static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test. I'm skeptical of being too smart here, but I'd like to hear what you mean. I'm also not clear whether you think we should abandon the built-in provider, or still select it for C/POSIX. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/7de2dc15-5211-45b3-afcb-71dcaf7a0...@manitou-mail.org [2] https://www.postgresql.org/message-id/daa9f060aa2349ebc8515efece49e7b32c5d.ca...@j-davis.com
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > I implemented a compromise where initdb will > change C.UTF-8 to the built-in provider This handling of C.UTF-8 would be felt by users as simply broken. With the v10 patches: $ initdb --locale=C.UTF-8 initdb: using locale provider "builtin" for ICU locale "C.UTF-8" The database cluster will be initialized with this locale configuration: default collation provider: builtin default collation locale:C LC_COLLATE: C.UTF-8 LC_CTYPE:C.UTF-8 This setup is not what the user has asked for and leads to that kind of wrong results: $ psql -c "select upper('é')" ?column? -- é whereas in v15 we would get the correct result 'É'. Then once inside that cluster, trying to create a database: postgres=# create database test locale='C.UTF-8'; ERROR: locale provider "builtin" does not support locale "C.UTF-8" HINT: The built-in locale provider only supports the "C" and "POSIX" locales. That hardly makes sense considering that initdb stated the opposite, that the "built-in provider" was adequate for C.UTF-8 In general about the evolution of the patchset, your interpretation of "defaulting to ICU" seems to be "avoid libc at any cost", which IMV is unreasonably user-hostile. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
On 6/8/23 17:15, Jeff Davis wrote: On Wed, 2023-06-07 at 20:52 -0400, Joe Conway wrote: If the provider has no such thing, throw an error. Just to be clear, that implies that users (and buildfarm members) with LANG=C.UTF-8 in their environment would not be able to run a plain "initdb -D data"; they'd get an error. It's hard for me to estimate how many users might be inconvenienced by that, but it sounds like a risk. Well, but only if their libc provider does not have C.UTF-8, correct? I see Linux Mint 21.1:/usr/lib/locale/C.utf8 RHEL 8: /usr/lib/locale/C.utf8 RHEL 9: /usr/lib/locale/C.utf8 AL2:/usr/lib/locale/C.utf8 However I do not see it on RHEL 7 :-( Perhaps for this specific case, and only in initdb, we change C.anything and POSIX.anything to the builtin provider? Might be best, with some kind of warning perhaps? CREATE DATABASE and CREATE COLLATION could still reject such locales. Seems to make sense. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order changes in PG16 since ICU introduction
On Wed, 2023-06-07 at 20:52 -0400, Joe Conway wrote: > If the provider has no such thing, throw an error. Just to be clear, that implies that users (and buildfarm members) with LANG=C.UTF-8 in their environment would not be able to run a plain "initdb -D data"; they'd get an error. It's hard for me to estimate how many users might be inconvenienced by that, but it sounds like a risk. Perhaps for this specific case, and only in initdb, we change C.anything and POSIX.anything to the builtin provider? CREATE DATABASE and CREATE COLLATION could still reject such locales. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > As I replied in that subthread, that creates a worse problem: if you > only change the provider when the locale is C, then what about when the > locale is *not* C? > > export LANG=en_US.UTF-8 > initdb -D data --locale=fr_FR.UTF-8 > ... >provider:icu >ICU locale: en-US What you're proposing with the 0003 patch still applies. In the above case I think we would end up with: provider=icu ICU locale=fr-FR lc_collate=fr_FR.UTF-8 lc_lctype=fr_FR.UTF-8 which is reasonable. In the following cases we would initialize a libc cluster instead of an ICU cluster: - initdb --locale=C - initdb --locale=POSIX - LANG=C initdb - LANG=C.UTF-8 initdb - LANG=POSIX initdb - ... possibly other locales that we find are unsuitable for ICU That is, the rule "ICU by default" really means "ICU unless the locale that we're being passed or getting from the environment has semantics that ICU does not provide but we know libc provides, in which case we fall back to libc". The user who wants ICU imperatively should invoke --icu-locale=something or --locale=something --locale-provider=icu in which case we should not fallback to libc. We still have to determine lc_collate and lc_ctype either from the environment or from the locale argument (I think we should favor the environment), except if the user specifies --lc-collate=... lc-ctype=... Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
Tatsuo Ishii wrote: > >> Yes it's a special case but when doing initdb --locale=C, a user does > >> not need or want an ICU locale. They want the same thing than what v15 > >> does with the same arguments: a template0 database with > >> datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL. > > So in this case the only way to keep the same behavior in v16 with "initdb > --locale=C" (--no-locale) in v15 is, bulding PostgreSQL --without-icu? AFAIK the --no-locale case in v16 is fixed since: commit 5cd1a5af4d17496a58678c8eb7ab792119c2d723 Author: Jeff Davis Date: Fri Apr 21 13:11:18 2023 -0700 Fix initdb --no-locale. Discussion: https://postgr.es/m/878relf7cb@news-spur.riddles.org.uk Reported-by: Andrew Gierth The --locale=C case is still being discussed. To me it should produce the same result than --no-locale and --locale=C in v15, that is, "ICU is the default" does not apply to that case, but currently it initializes the cluster with an ICU locale. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
>> As I replied in that subthread, that creates a worse problem: if you >> only change the provider when the locale is C, then what about when the >> locale is *not* C? >> >> export LANG=en_US.UTF-8 >> initdb -D data --locale=fr_FR.UTF-8 >> ... >> provider:icu >> ICU locale: en-US >> >> I believe that case is an order of magnitude worse than the other cases >> you brought up in that subthread. >> >> It also leaves the fundamental problem in place that LOCALE only >> applies to the libc provider, which multiple people have agreed is not >> acceptable. Note that most of PostgreSQL users in Japan do initdb --no-locale. Almost never use other than C locale because the users do not rely on system collation. Most database have an extra column which represents the pronunciation in Hiragana or Katakana. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Order changes in PG16 since ICU introduction
On 6/7/23 19:26, Jeff Davis wrote: * What do we do in the case where the environment has LANG=C.UTF-8 (as some buildfarm members do)? Is an error acceptable in that case? If I understand the discussion so far correctly, I think that case should fall to the provider. If it supports "C.UTF-8" explicitly as some distributions do, then use it. If the provider has no such thing, throw an error. Somewhere we should document that "C.UTF-8" from the provider might not be as stable or working as they expect. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order changes in PG16 since ICU introduction
Hi, > On Wed, 2023-06-07 at 23:50 +0200, Daniel Verite wrote: >> The simplest way to obtain that in v16 is to teach initdb that >> --locale=C without the --locale-provider option implies that >> --locale-provider=libc ([1]) > > As I replied in that subthread, that creates a worse problem: if you > only change the provider when the locale is C, then what about when the > locale is *not* C? > > export LANG=en_US.UTF-8 > initdb -D data --locale=fr_FR.UTF-8 > ... > provider:icu > ICU locale: en-US > > I believe that case is an order of magnitude worse than the other cases > you brought up in that subthread. > > It also leaves the fundamental problem in place that LOCALE only > applies to the libc provider, which multiple people have agreed is not > acceptable. Daniels comment: >> Yes it's a special case but when doing initdb --locale=C, a user does >> not need or want an ICU locale. They want the same thing than what v15 >> does with the same arguments: a template0 database with >> datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL. So in this case the only way to keep the same behavior in v16 with "initdb --locale=C" (--no-locale) in v15 is, bulding PostgreSQL --without-icu? Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Order changes in PG16 since ICU introduction
On Thu, 2023-06-08 at 00:11 +0200, Peter Eisentraut wrote: > On 05.06.23 19:54, Jeff Davis wrote: > > New patch series attached. > > Could you clarify what here is intended for 16 and what is for later? I apologize about the patch churn here. I implemented several approaches to see what feedback I get, and now it looks like we're returning to a previous idea (the "builtin" provider). In v16: 1. We need LOCALE to apply to all providers. 2. We need LOCALE=C to give the memcmp/pg_ascii behavior in all cases (unless overridden by a separate LC_COLLATE or LC_CTYPE parameter). Those are the biggest problems raised in this thread, and the patches to accomplish those things are in scope for v16. After we sort those out, there are two loose ends: * What do we do in the case where the environment has LANG=C.UTF-8 (as some buildfarm members do)? Is an error acceptable in that case? * Do we move icu_validation_level back to ERROR? Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Wed, 2023-06-07 at 23:50 +0200, Daniel Verite wrote: > The simplest way to obtain that in v16 is to teach initdb that > --locale=C without the --locale-provider option implies that > --locale-provider=libc ([1]) As I replied in that subthread, that creates a worse problem: if you only change the provider when the locale is C, then what about when the locale is *not* C? export LANG=en_US.UTF-8 initdb -D data --locale=fr_FR.UTF-8 ... provider:icu ICU locale: en-US I believe that case is an order of magnitude worse than the other cases you brought up in that subthread. It also leaves the fundamental problem in place that LOCALE only applies to the libc provider, which multiple people have agreed is not acceptable. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 05.06.23 19:54, Jeff Davis wrote: New patch series attached. Could you clarify what here is intended for 16 and what is for later? This patch set keeps expanding and changing in each iteration. There is a PG16 open item linked to this thread * The rules for choosing default ICU locale seem pretty unfriendly which I think would be addressed by an appropriately fixed up variant of your patch 0003. (Or if not, what is the actual issue?) Everything else appears to be either new feature work or fixing pre-existing prehavior, so is not in scope for PG16 and should be dealt with elsewhere, so we can focus here on closing out this release.
Re: Order changes in PG16 since ICU introduction
On 05.06.23 19:54, Jeff Davis wrote: New patch series attached. I plan to commit 0001 and 0002 soon, unless there are objections. 0001 causes the "C" and "POSIX" locales to be treated with memcmp/pg_ascii semantics in ICU, just like in libc. We also considered a new "none" provider, but it's more invasive, and we can always reconsider that in the v17 cycle. 0002 introduces an upgrade check for users who have explicitly requested provider=icu and iculocale=C on older versions, and rejects upgrading from v15 in that case to avoid index corruption. Having such a collation is almost certainly a mistake by the user, because the collator would not give the expected memcmp semantics. I'm dubious about these two. 0003 seems like the correct direction. In createdb.c, the change you add makes sense, but you should also remove the existing use of the locale variable: - if (locale) - { - if (!lc_ctype) - lc_ctype = locale; - if (!lc_collate) - lc_collate = locale; - } -
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > The locale "C" is a special case, documented as a non-locale. So, if > LOCALE/--locale apply to ICU, then either ICU needs to handle locale > "C" in the expected way (v8 patch series); or when we see locale "C" we > need to somehow change the provider into something that can handle it > (v6 patch series changes it to the "none" provider). Yes it's a special case but when doing initdb --locale=C, a user does not need or want an ICU locale. They want the same thing than what v15 does with the same arguments: a template0 database with datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL. The simplest way to obtain that in v16 is to teach initdb that --locale=C without the --locale-provider option implies that --locale-provider=libc ([1]) OTOH what you're proposing with the 0001 patch is much more involved in terms of tweaking the ICU code so that dateiculocale='C' and datlocprovider='i' becomes a combination that provides the C semantics that ICU doesn't have natively. I don't agree with the reasoning that to make progress with the other uses of --locale, we need to start by either making ICU support C/POSIX (0001/0002), or add a new "none/builtin" provider (previous set of patches). v16 should not need it any more than v15 did, if v16 does the same as v15 on locale='C', that is not involve ICU at all. > Then that enables us to change LOCALE/--locale to apply to ICU, which > means that a simple command like "initdb --locale=en_US" does a > sensible thing regardless of the default provider. > > I understand you are skeptical of trying to apply an arbitrary locale > name to ICU, but if they don't specify the provider, what do you expect > to happen? It's a hard question because it depends on what people have in their locale environment combined with what they try to do. I think that initdb without any locale option should work well in the majority of environments, but specifying a locale alone will not work well in a number of cases, so users might end up concluding that they need to specify not only the provider but lc_collate/lc_ctype. [1] https://www.postgresql.org/message-id/360c90b9-7c20-4cec-aade-38e6e3351...@manitou-mail.org Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
On 22.05.23 19:35, Jeff Davis wrote: On Thu, 2023-05-11 at 13:07 +0200, Peter Eisentraut wrote: Here is my proposed patch for this. The commit message makes it sound like lc_collate/ctype are completely obsolete, and I don't think that's quite right: they still represent the server environment, which does still matter in some cases. I'd just say that they are too confusing (likely to be misused), and becoming obsolete (or less relevant), or something along those lines. Otherwise, this is fine with me. I didn't do a detailed review because it's just mechanical. I have committed this with some tuning of the commit message.
Re: Order changes in PG16 since ICU introduction
> "Joe" == Joe Conway writes: > On 6/6/23 15:55, Tom Lane wrote: >> Robert Haas writes: >>> On Tue, Jun 6, 2023 at 3:25 PM Tom Lane wrote: Also +1, except that I find "none" a rather confusing choice of name. There *is* a provider, it's just PG itself not either libc or ICU. I thought Joe's suggestion of "internal" made more sense. >> >>> Or perhaps "builtin" or "postgresql". >> Either OK by me Joe> Same here I like either "internal" or "builtin" because they correctly identify that no external resources are used. I'm not keen on "postgresql". -- Andrew.
Re: Order changes in PG16 since ICU introduction
"Jonathan S. Katz" writes: > Since we're bikeshedding, "postgresql" or "builtin" could make it seem > to a (app) developer that these may be recommended options, as we're > trusting PostgreSQL to make the best choices for us. Granted, v16 is > (theoretically) defaulting to ICU, so that choice is made, but the > unsuspecting developer could make a switch based on that naming. I don't think this is a problem, as long as any locale name other than C/POSIX fails when combined with that provider name. regards, tom lane
Re: Order changes in PG16 since ICU introduction
On 6/6/23 3:56 PM, Joe Conway wrote: On 6/6/23 15:55, Tom Lane wrote: Robert Haas writes: On Tue, Jun 6, 2023 at 3:25 PM Tom Lane wrote: Also +1, except that I find "none" a rather confusing choice of name. There *is* a provider, it's just PG itself not either libc or ICU. I thought Joe's suggestion of "internal" made more sense. Or perhaps "builtin" or "postgresql". Either OK by me Same here Since we're bikeshedding, "postgresql" or "builtin" could make it seem to a (app) developer that these may be recommended options, as we're trusting PostgreSQL to make the best choices for us. Granted, v16 is (theoretically) defaulting to ICU, so that choice is made, but the unsuspecting developer could make a switch based on that naming. However, I don't have a strong alternative -- I understand the concern about "internal", so I'd be OK with "postgresql" unless a better name appears. Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: Order changes in PG16 since ICU introduction
On 6/6/23 15:55, Tom Lane wrote: Robert Haas writes: On Tue, Jun 6, 2023 at 3:25 PM Tom Lane wrote: Also +1, except that I find "none" a rather confusing choice of name. There *is* a provider, it's just PG itself not either libc or ICU. I thought Joe's suggestion of "internal" made more sense. Or perhaps "builtin" or "postgresql". Either OK by me Same here -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order changes in PG16 since ICU introduction
Robert Haas writes: > On Tue, Jun 6, 2023 at 3:25 PM Tom Lane wrote: >> Also +1, except that I find "none" a rather confusing choice of name. >> There *is* a provider, it's just PG itself not either libc or ICU. >> I thought Joe's suggestion of "internal" made more sense. > Or perhaps "builtin" or "postgresql". Either OK by me regards, tom lane
Re: Order changes in PG16 since ICU introduction
On Tue, Jun 6, 2023 at 3:25 PM Tom Lane wrote: > Joe Conway writes: > > On 6/6/23 15:18, Jeff Davis wrote: > >> The locale "C" is a special case, documented as a non-locale. So, if > >> LOCALE/--locale apply to ICU, then either ICU needs to handle locale > >> "C" in the expected way (v8 patch series); or when we see locale "C" we > >> need to somehow change the provider into something that can handle it > >> (v6 patch series changes it to the "none" provider). > > > +1 to the latter approach > > Also +1, except that I find "none" a rather confusing choice of name. > There *is* a provider, it's just PG itself not either libc or ICU. > I thought Joe's suggestion of "internal" made more sense. Or perhaps "builtin" or "postgresql". I'm just thinking that "internal" as a type name kind of means "you shouldn't be touching this from SQL" and we don't want to give people the idea that the "C" locale isn't something you should use. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Order changes in PG16 since ICU introduction
Joe Conway writes: > On 6/6/23 15:18, Jeff Davis wrote: >> The locale "C" is a special case, documented as a non-locale. So, if >> LOCALE/--locale apply to ICU, then either ICU needs to handle locale >> "C" in the expected way (v8 patch series); or when we see locale "C" we >> need to somehow change the provider into something that can handle it >> (v6 patch series changes it to the "none" provider). > +1 to the latter approach Also +1, except that I find "none" a rather confusing choice of name. There *is* a provider, it's just PG itself not either libc or ICU. I thought Joe's suggestion of "internal" made more sense. regards, tom lane
Re: Order changes in PG16 since ICU introduction
On 6/6/23 15:18, Jeff Davis wrote: On Tue, 2023-06-06 at 15:09 +0200, Daniel Verite wrote: FWIW I don't quite see how 0001 improve things or what problem it's trying to solve. The word "locale" is generic, so we need to make LOCALE/--locale apply to whatever provider is being used. If "locale" only applies to libc, using ICU will always be confusing and never be on the same level as libc, let alone the preferred provider. Agree 100% The locale "C" is a special case, documented as a non-locale. So, if LOCALE/--locale apply to ICU, then either ICU needs to handle locale "C" in the expected way (v8 patch series); or when we see locale "C" we need to somehow change the provider into something that can handle it (v6 patch series changes it to the "none" provider). +1 to the latter approach Please let me know if you disagree with the goal or the reasoning here. If so, please explain where you think we should end up, because the status quo does not seem great to me. also +1 0001 creates exceptions throughout the code so that when an ICU collation has a locale name "C" or "POSIX" then it does not behave like an ICU collation, even though pg_collation.collprovider='i' To me it's neither desirable nor necessary that a collation that has collprovider='i' is diverted to non-ICU semantics. It's not very principled, but it matches what libc does. Makes sense to me -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order changes in PG16 since ICU introduction
On 6/6/23 15:15, Jeff Davis wrote: On Tue, 2023-06-06 at 14:11 -0400, Joe Conway wrote: This discussion makes me wonder (though probably too late for the v16 cycle) if we shouldn't treat "C" and "POSIX" locales to be a third provider, something like "internal". That's exactly what I did in v6 of this series: I created a "none" provider, and when someone specified provider=icu iculocale=C, it would change the provider to "none": https://www.postgresql.org/message-id/5f9bf4a0b040428c5db2dc1f23cc3ad96acb5672.camel%40j-davis.com I'm fine with either approach. Ha! Well it seems like I am +1 on that then ;-) -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order changes in PG16 since ICU introduction
On Tue, 2023-06-06 at 14:11 -0400, Joe Conway wrote: > This discussion makes me wonder (though probably too late for the v16 > cycle) if we shouldn't treat "C" and "POSIX" locales to be a third > provider, something like "internal". That's exactly what I did in v6 of this series: I created a "none" provider, and when someone specified provider=icu iculocale=C, it would change the provider to "none": https://www.postgresql.org/message-id/5f9bf4a0b040428c5db2dc1f23cc3ad96acb5672.camel%40j-davis.com I'm fine with either approach. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 6/6/23 09:09, Daniel Verite wrote: Jeff Davis wrote: New patch series attached. I plan to commit 0001 and 0002 soon, unless there are objections. 0001 causes the "C" and "POSIX" locales to be treated with memcmp/pg_ascii semantics in ICU, just like in libc. We also considered a new "none" provider, but it's more invasive, and we can always reconsider that in the v17 cycle. 0001 creates exceptions throughout the code so that when an ICU collation has a locale name "C" or "POSIX" then it does not behave like an ICU collation, even though pg_collation.collprovider='i' To me it's neither desirable nor necessary that a collation that has collprovider='i' is diverted to non-ICU semantics. This discussion makes me wonder (though probably too late for the v16 cycle) if we shouldn't treat "C" and "POSIX" locales to be a third provider, something like "internal". -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > New patch series attached. I plan to commit 0001 and 0002 soon, unless > there are objections. > > 0001 causes the "C" and "POSIX" locales to be treated with > memcmp/pg_ascii semantics in ICU, just like in libc. We also > considered a new "none" provider, but it's more invasive, and we can > always reconsider that in the v17 cycle. FWIW I don't quite see how 0001 improve things or what problem it's trying to solve. 0001 creates exceptions throughout the code so that when an ICU collation has a locale name "C" or "POSIX" then it does not behave like an ICU collation, even though pg_collation.collprovider='i' To me it's neither desirable nor necessary that a collation that has collprovider='i' is diverted to non-ICU semantics. Also in the current state, this diversion does not apply to initdb. "initdb --icu-locale=C" with 0001 applied reports this: Using language tag "en-US-u-va-posix" for ICU locale "C". The database cluster will be initialized with this locale configuration: provider:icu ICU locale: en-US-u-va-posix LC_COLLATE: fr_FR.UTF-8 [...] and "initdb --locale=C" reports this: Using default ICU locale "fr_FR". Using language tag "fr-FR" for ICU locale "fr_FR". The database cluster will be initialized with this locale configuration: provider:icu ICU locale: fr-FR LC_COLLATE: C [...] Could you elaborate a bit more on what 0001 is meant to achieve, from the point of view of the user? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > > #1 > > > > postgres=# create database test1 locale='fr_FR.UTF-8'; > > NOTICE: using standard form "fr-FR" for ICU locale "fr_FR.UTF-8" > > ERROR: new ICU locale (fr-FR) is incompatible with the ICU locale of > > I don't see a problem here. If you specify LOCALE to CREATE DATABASE, > you should either be using "TEMPLATE template0", or you should be > expecting an error if the LOCALE doesn't match exactly. > > What would you like to see happen here? What's odd is that initdb starting in an fr_FR.UTF-8 environment found that "fr" was the default ICU locale to use, whereas "create database" reports that "fr" and "fr_FR.UTF-8" refer to incompatible locales. To me initdb is wrong when coming up with the less precise "fr" instead of "fr-FR". I suggest the attached patch to call uloc_getDefault() instead of the current code that somehow leaves out the country/region component. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 31156e863b..09a5c98cc0 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2354,42 +2354,13 @@ icu_validate_locale(const char *loc_str) } /* - * Determine default ICU locale by opening the default collator and reading - * its locale. - * - * NB: The default collator (opened using NULL) is different from the collator - * for the root locale (opened with "", "und", or "root"). The former depends - * on the environment (useful at initdb time) and the latter does not. + * Determine the default ICU locale */ static char * default_icu_locale(void) { #ifdef USE_ICU - UCollator *collator; - UErrorCode status; - const char *valid_locale; - char *default_locale; - - status = U_ZERO_ERROR; - collator = ucol_open(NULL, &status); - if (U_FAILURE(status)) - pg_fatal("could not open collator for default locale: %s", - u_errorName(status)); - - status = U_ZERO_ERROR; - valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, - &status); - if (U_FAILURE(status)) - { - ucol_close(collator); - pg_fatal("could not determine default ICU locale"); - } - - default_locale = pg_strdup(valid_locale); - - ucol_close(collator); - - return default_locale; + return pg_strdup(uloc_getDefault()); #else pg_fatal("ICU is not supported in this build"); #endif
Re: Order changes in PG16 since ICU introduction
On Mon, 2023-05-22 at 14:34 +0200, Peter Eisentraut wrote: > Please put blank lines between > > > > > etc., matching existing style. > > We usually don't capitalize the collation parameters like > > CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP); > > elsewhere in the documentation. > > Table 24.2. ICU Collation Settings should probably be sorted by key, > or > at least by something. > > All tables should referenced in the text, like "Table x.y shows this > and > that." (Note that a table could float to a different page in some > output formats, so just putting it into a section without some > introductory text isn't sound.) Thank you, done. > Table 24.1. ICU Collation Levels shows punctuation as level 4, which > is > only true in shifted mode, which isn't the default. The whole > business > of treating variable collation elements is getting a bit lost in this > description. The kv option is described as "Classes of characters > ignored during comparison at level 3.", which is effectively true but > not the whole picture. I organized the documentation around practical examples and available options, and less around the conceptual model. I think that's a good start, but you're right that it over-simplifies in a few areas. Discussing the model would work better along with an explanation of ICU rules, where you can make better use of those concepts. I feel like there are some interesting things that can be done with rules, but I haven't had a chance to really dig in yet. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 5/24/23 11:39, Jeff Davis wrote: On Mon, 2023-05-22 at 22:09 +0200, Daniel Verite wrote: In practice we're probably getting the "und" ICU locale whereas "fr" would be appropriate. This is a good point and illustrates that ICU is not a drop-in replacement for libc in all cases. I don't see a solution here that doesn't involve some rough edges, though. "Locale" is a generic term, and if we continue to insist that it really means a libc locale, then ICU will never be on an equal footing with libc, let alone the preferred provider. Huge +1 IMHO the experience should be unified to the degree possible. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Order changes in PG16 since ICU introduction
On Mon, 2023-05-22 at 22:09 +0200, Daniel Verite wrote: > While I agree that the LOCALE option in CREATE DATABASE is > counter-intuitive, I think it's more than that. As Andreww Gierth pointed out: $ initdb --locale=fr_FR ... ICU locale: en-US ... Is more than just counter-intuitive. I don't think we can ship 16 that way. > I find it questionable that blending ICU > and libc locales into it helps that much with the user experience. Thank you for going through some examples here. I agree that it's not perfect, but we need some path to a reasonable ICU user experience, and I think we'll have to accept some rough edges to avoid the worst cases, like above. > initdb: > > Using default ICU locale "fr". > Using language tag "fr" for ICU locale "fr". > ... > #1 > > postgres=# create database test1 locale='fr_FR.UTF-8'; > NOTICE: using standard form "fr-FR" for ICU locale "fr_FR.UTF-8" > ERROR: new ICU locale (fr-FR) is incompatible with the ICU locale of I don't see a problem here. If you specify LOCALE to CREATE DATABASE, you should either be using "TEMPLATE template0", or you should be expecting an error if the LOCALE doesn't match exactly. What would you like to see happen here? > #2 > > postgres=# create database test2 locale='C.UTF-8' > template='template0'; > NOTICE: using standard form "en-US-u-va-posix" for ICU locale > "C.UTF-8" > CREATE DATABASE > > > en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so > this interpretation is arguably not what a user would expect. As you pointed out, this is not settled in libc either: https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f%40manitou-mail.org We really can't expect a particular order for a particular locale name, unless we handle it specially like "C" or "POSIX". If we pass it to the provider, we have to trust the provider to match our conceptual expectations for that locale (and ideally version it properly). > I would expect the ICU warning or error (icu_validation_level) to > kick > in instead of that transliteration. Earlier versions of ICU (<= 63) do this transformation automatically, and I don't see a reason to throw an error if ICU considers it valid. The language tag en-US-u-va-posix will be stored in the catalog, and that will be considered valid in later versions of ICU. Later versions of ICU (>= 64) consider locales with a language name of "C" to be obsolete and no longer valid. I added code to do the transformation without error in these later versions, but I think we have agreement to remove it. If a user specifies the locale as "C.UTF-8", we can either pass it to ICU and see whether that version accepts it or not (and if not, throw a warning/error); or if we decide that "C.UTF-8" really means "C", we can handle it in the memcmp() path like C and never send it to ICU. > #3 > > $ grep french /etc/locale.alias > french fr_FR.ISO-8859-1 > > postgres=# create database test3 locale='french' template='template0' > encoding='LATIN1'; > WARNING: ICU locale "french" has unknown language "french" > HINT: To disable ICU locale validation, set parameter > icu_validation_level > to DISABLED. > CREATE DATABASE > > > In practice we're probably getting the "und" ICU locale whereas "fr" > would > be appropriate. This is a good point and illustrates that ICU is not a drop-in replacement for libc in all cases. I don't see a solution here that doesn't involve some rough edges, though. "Locale" is a generic term, and if we continue to insist that it really means a libc locale, then ICU will never be on an equal footing with libc, let alone the preferred provider. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > If we special case locale=C, but do nothing for locale=fr_FR, then I'm > not sure we've solved the problem. Andrew Gierth raised the issue here, > which he called "maximally confusing": > > https://postgr.es/m/874jp9f5jo@news-spur.riddles.org.uk > > That's why I feel that we need to make locale apply to whatever the > provider is, not just when it happens to be C. While I agree that the LOCALE option in CREATE DATABASE is counter-intuitive, I find it questionable that blending ICU and libc locales into it helps that much with the user experience. Trying the lastest v6-* patches applied on top of 722541ead1 (before the pgindent run), here are a few examples when I don't think it goes well. The OS is Ubuntu 22.04 (glibc 2.35, ICU 70.1) initdb: Using default ICU locale "fr". Using language tag "fr" for ICU locale "fr". The database cluster will be initialized with this locale configuration: provider:icu ICU locale: fr LC_COLLATE: fr_FR.UTF-8 LC_CTYPE:fr_FR.UTF-8 LC_MESSAGES: fr_FR.UTF-8 LC_MONETARY: fr_FR.UTF-8 LC_NUMERIC: fr_FR.UTF-8 LC_TIME: fr_FR.UTF-8 The default database encoding has accordingly been set to "UTF8". #1 postgres=# create database test1 locale='fr_FR.UTF-8'; NOTICE: using standard form "fr-FR" for ICU locale "fr_FR.UTF-8" ERROR: new ICU locale (fr-FR) is incompatible with the ICU locale of the template database (fr) HINT: Use the same ICU locale as in the template database, or use template0 as template. That looks like a fairly generic case that doesn't work seamlessly. #2 postgres=# create database test2 locale='C.UTF-8' template='template0'; NOTICE: using standard form "en-US-u-va-posix" for ICU locale "C.UTF-8" CREATE DATABASE en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so this interpretation is arguably not what a user would expect. I would expect the ICU warning or error (icu_validation_level) to kick in instead of that transliteration. #3 $ grep french /etc/locale.alias french fr_FR.ISO-8859-1 postgres=# create database test3 locale='french' template='template0' encoding='LATIN1'; WARNING: ICU locale "french" has unknown language "french" HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. CREATE DATABASE In practice we're probably getting the "und" ICU locale whereas "fr" would be appropriate. I assume that we would find more cases like that if testing on many operating systems. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
On Mon, 2023-05-22 at 14:27 +0200, Peter Eisentraut wrote: > The rules are for setting whatever sort order you like. Maybe you > want > to sort + before - or whatever. It's like, if you don't like it, > build > your own. A build-your-own feature is fine, but it's not completely zero cost. There some risk that rules specified for ICU version X fail to load for ICU version Y. If that happens to your database default collation, you are in big trouble. The risk of failing to load a language tag in a later version, especially one returned by uloc_toLanguageTag() in strict mode, is much lower. We can reduce the risk by allowing rules only for CREATE COLLATION (not CREATE DATABASE), and see what users do with it first, and consider adding it to CREATE DATABASE later. We can also try to explain in the docs that it's a build-it-yourself kind of feature (use it if you see a purpose, otherwise ignore it), though I'm not sure quite how we should word it. And I'm skeptical that we don't have a single plausible end-to-end user story. I just can't think of any reason someone would need something like this, given how flexible the collation settings in the language tags are. The best case I can think of is if someone is trying to make an ICU collation that matches some non-ICU collation in another system, which sounds hard; but perhaps it's reasonable to do in cases where it just needs to work well-enough in some limited case. Also, do we have an answer as to why specifying the rules as '' is not the same as not specifying any rules[1]? [1] https://www.postgresql.org/message-id/36a6e89689716c2ca1fae8adc8e84601a041121c.ca...@j-davis.com > The co settings are just everything else. > They are not parametric, they are just some other sort order that > someone spelled out explicitly. This sounds like another case where we can't really tell the user why they would want to use a specific "co" setting; they should only use it if they already know they want it. Is there some way we can word that in the documentation so that people don't misuse them? For instance, one of them is called "emoji". I'm sure a lot of applications use emoji (or at least might encounter them), should they always use co-emoji, or would some people who are using emoji not want it? Can it be combined with "ks" or other "k*" settings? What I'm trying to avoid is users seeing something in the documentation and using it without it really being a good fit for their problem. Then they see something unexpected, and need to rebuild all of their indexes or something. > > * I don't understand what "kc" means if "ks" is not set to > > "level1". > > There is an example here: > https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings#colcaselevel Interesting, thank you. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Thu, 2023-05-11 at 13:07 +0200, Peter Eisentraut wrote: > Here is my proposed patch for this. The commit message makes it sound like lc_collate/ctype are completely obsolete, and I don't think that's quite right: they still represent the server environment, which does still matter in some cases. I'd just say that they are too confusing (likely to be misused), and becoming obsolete (or less relevant), or something along those lines. Otherwise, this is fine with me. I didn't do a detailed review because it's just mechanical. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Thu, 2023-05-11 at 13:09 +0200, Peter Eisentraut wrote: > There is also the deterministic flag and the icurules setting. > Depending on what level of detail you imagine the user needs, you > really > do need to look at the whole picture, not some subset of it. (Nit: all database default collations are deterministic.) I agree, but I think there should be a way to see the whole picture in one command. If nothing else, for repro cases sent to the list, it would be nice to have a single line like: SELECT show_default_collation_whole_picture(); Right now it involves some back and forth, like checking datlocprovider, then looking in the right fields and ignoring the wrong ones. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 18.05.23 00:59, Jeff Davis wrote: On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote: Other than that, and I took your suggestions almost verbatim. Patch attached. Thank you! Attached new patch with a typo fix and a few other edits. I plan to commit soon. Some small follow-up on this patch: Please put blank lines between etc., matching existing style. We usually don't capitalize the collation parameters like CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP); elsewhere in the documentation. Table 24.2. ICU Collation Settings should probably be sorted by key, or at least by something. All tables should referenced in the text, like "Table x.y shows this and that." (Note that a table could float to a different page in some output formats, so just putting it into a section without some introductory text isn't sound.) Table 24.1. ICU Collation Levels shows punctuation as level 4, which is only true in shifted mode, which isn't the default. The whole business of treating variable collation elements is getting a bit lost in this description. The kv option is described as "Classes of characters ignored during comparison at level 3.", which is effectively true but not the whole picture.
Re: Order changes in PG16 since ICU introduction
On 18.05.23 19:55, Jeff Davis wrote: On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote: I did a quicker read through this time. LGTM overall. I like what you did with the explanations around sensitivity (now it makes sense). Committed, thank you. There are a few things I don't understand that would be good to document better: * Rules. I still don't quite understand the use case: are these for people inventing new languages? What is a plausible use case that isn't covered by the existing locales and collation settings? Do rules make sense for a database default collation? Are they for language experts only or might an ordinary developer benefit from using them? The rules are for setting whatever sort order you like. Maybe you want to sort + before - or whatever. It's like, if you don't like it, build your own. * The collation types "phonebk", "emoji", etc.: are these variants of particular locales, or do they make sense in multiple locales? I don't know where they fit in or how to document them. The k* settings are parametric settings, in that they transform the sort key in some algorithmic way. The co settings are just everything else. They are not parametric, they are just some other sort order that someone spelled out explicitly. * I don't understand what "kc" means if "ks" is not set to "level1". There is an example here: https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings#colcaselevel
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-05-19 at 21:13 +0200, Daniel Verite wrote: > ISTM that if we want to go that route, we need the make the minimum > changes at the user interface level and not any deeper, so that when > (locale="C" OR locale="POSIX") AND the provider has not been > specified, > then the command (initdb and create database) act as if the user had > specified provider=libc. If we special case locale=C, but do nothing for locale=fr_FR, then I'm not sure we've solved the problem. Andrew Gierth raised the issue here, which he called "maximally confusing": https://postgr.es/m/874jp9f5jo@news-spur.riddles.org.uk That's why I feel that we need to make locale apply to whatever the provider is, not just when it happens to be C. > > (3) Support iculocale=C in the ICU provider using the memcmp() > > path. > > > In other words, if provider=icu and iculocale=C, lc_collate_is_c() > > and > > lc_ctpye_is_c() would both return true. > > ICU does not provide a locale that behaves like that, and it doesn't > feel right to pretend it does. It feels like attacking the problem > at the wrong level. I agree that #3 feels slightly wrong, but I think it's still a viable option until we have consensus on something better. > > (4) Create a new "none" provider (which has no locale and always > > memcmp > > semantics), and automatically change the provider to "none" if > > provider=icu and iculocale=C. > > It still uses libc/C for character classification and case changing, > so "no locale" is technically not true. The provider affects callers that have a pg_locale_t, such as the SQL- callable lower() function. For those callers, the "none" provider uses pg_ascii_tolower(), etc., not libc. That's why I called it "none" -- it's using simple internal postgres implementations instead of a provider. For callers that don't have a pg_locale_t, they may call libc functions directly and rely on the server environment. But in those cases, there's no way to set a provider at all, it's just relying on the server environment. There aren't many of these cases, and hopefully we can eliminate the reliance on the server environment over time. If I'm missing something, let me know what cases you have in mind. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
Jeff Davis writes: > Committed, thank you. This commit has given the PDF docs build some indigestion: Making portrait pages on A4 paper (210mmx297mm) /home/postgres/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf [WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting with "Symbol,normal,400". [WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. Substituting with "ZapfDingbats,normal,400". [WARN] FOUserAgent - Hyphenation pattern not found. URI: en. [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by 3531 millipoints. (See position 55117:2388) [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by 1871 millipoints. (See position 55117:12998) [WARN] FOUserAgent - Glyph "?" (0x323, dotbelowcmb) not available in font "Courier". [WARN] FOUserAgent - Glyph "?" (0x302, circumflexcmb) not available in font "Courier". [WARN] FOUserAgent - The contents of fo:block line 12 exceed the available area in the inline-progression direction by 20182 millipoints. (See position 55172:188) [WARN] FOUserAgent - The contents of fo:block line 10 exceed the available area in the inline-progression direction by 17682 millipoints. (See position 55172:188) [WARN] FOUserAgent - Glyph "?" (0x142, lslash) not available in font "Times-Roman". [WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value found on the parent FO. (The first three and last one warnings are things we've been living with, but the ones between are new.) The first two "exceed the available area" complaints are in the "ICU Collation Levels" table. We can silence them by providing some column width hints to make the "Description" column a tad wider than the rest, as in the proposed patch attached. The other two, as well as the first two glyph-not-available complaints, are caused by this bit: Full normalization is important in some cases, such as when multiple accents are applied to a single character. For instance, 'ệ' can be composed of code points U&'\0065\0323\0302' or U&'\0065\0302\0323'. With full normalization on, these code point sequences are treated as equal; otherwise they are unequal. which renders just abysmally (see attached screen shot), and even in HTML where it's rendering about as intended, it really is an unintelligible example. Few people are going to understand that the circumflex and the dot-below are separately applied accents. How about we drop the explicit example and write something like Full normalization allows code point sequences such as characters with multiple accent marks applied in different orders to be seen as equal. ? (The last missing-glyph complaint is evidently from the release notes; I'll bug Bruce about that separately.) regards, tom lane diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 9db14649aa..96a23bf530 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -1140,6 +1140,14 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true ICU Collation Levels + + + + + + + + Level
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > 2) Automatically change the provider to libc when locale=C. > > Almost works, but it's not clear how we handle the case "provider=icu > lc_collate='fr_FR.utf8' locale=C". > > If we change it to "provider=libc lc_collate=C", we've overridden the > specified lc_collate. If we ignore the locale=C, that would be > surprising to users. If we throw an error, that would be a backwards > compatibility issue. This thread started with a report illustrating that when users mention the locale "C", they implicitly mean "C" from the libc provider, as when libc was the default. The problem is that as soon as ICU is the default, any reference to a libc collation should mention explicitly that the provider is libc. It seems what we're set on the idea to create an exception for "C" (and I assume also "POSIX") to avoid too much confusion, and because "C" is quite special anyway, and has no equivalent in ICU (the switch in v16 to ICU as the default provider is based on the premise that the locales with the same name will behave pretty much the same with ICU as they did with libc, but it's absolutely not the case with "C"). ISTM that if we want to go that route, we need the make the minimum changes at the user interface level and not any deeper, so that when (locale="C" OR locale="POSIX") AND the provider has not been specified, then the command (initdb and create database) act as if the user had specified provider=libc. > (3) Support iculocale=C in the ICU provider using the memcmp() path. > In other words, if provider=icu and iculocale=C, lc_collate_is_c() and > lc_ctpye_is_c() would both return true. ICU does not provide a locale that behaves like that, and it doesn't feel right to pretend it does. It feels like attacking the problem at the wrong level. > (4) Create a new "none" provider (which has no locale and always memcmp > semantics), and automatically change the provider to "none" if > provider=icu and iculocale=C. It still uses libc/C for character classification and case changing, so "no locale" is technically not true. Personally I don't see the benefit of adding a "none" provider. C is a libc locale and libc is not disappearing. I also think that when users explicitly indicate provider=icu, they should get icu. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
On Thu, 2023-05-18 at 20:11 +0200, Matthias van de Meent wrote: > As I complain about in [0], since 5cd1a5af --no-locale has been > broken > / bahiving outside it's description: Instead of being equivalent to > `--locale=C` it now also overrides `--locale-provider=libc`, > resulting > in undocumented behaviour. I agree that 5cd1a5af is incomplete. Posting updated patches. Feedback on the approaches below would be appreciated. For context, in version 15: $ initdb -D data --locale-provider=icu --icu-locale=en => create database clocale template template0 locale='C'; => select datname, datlocprovider, daticulocale from pg_database where datname='clocale'; datname | datlocprovider | daticulocale -++-- clocale | i | en (1 row) That behavior is confusing, and when I made ICU the default provider in v16, the confusion was extended into more cases. If we leave the CREATE DATABASE (and createdb and initdb) syntax in place, such that LOCALE (and --locale) do not apply to ICU at all, then I don't see a path to a good ICU user experience. Therefore I conclude that we need LOCALE (and --locale) to apply to ICU somehow. (The LOCALE option already applies to ICU during CREATE COLLATION, just not CREATE DATABASE or initdb.) Patch 0003 does this. It's fairly straightforward and I believe we need this patch. But to actually fix your complaint we also need --no-locale to be equivalent to --locale=C and for those options to both use memcmp() semantics. There are several approaches to accomplish this, and I think this is the part where I most need some feedback. There are only so many approaches, and each one has some potential downsides, but I believe we need to select one: (1) Give up and leave the existing CREATE DATABASE (and createdb, and initdb) semantics in place, along with the confusing behavior in v15. This is a last resort, in my opinion. It gives us no path toward a good user experience with ICU, and leaves us with all of the problems of the OS as a collation provider. (2) Automatically change the provider to libc when locale=C. Almost works, but it's not clear how we handle the case "provider=icu lc_collate='fr_FR.utf8' locale=C". If we change it to "provider=libc lc_collate=C", we've overridden the specified lc_collate. If we ignore the locale=C, that would be surprising to users. If we throw an error, that would be a backwards compatibility issue. One possible solution would be to change the catalog representation to allow setting the default collation locale separately from datcollate even for the libc provider. For instance, rename daticulocale to datdeflocale, and store the default collation locale there for both libc and ICU. Then, "provider=icu lc_collate='fr_FR.utf8' locale=C" could be changed into "provider=libc lc_collate='fr_FR.utf8' deflocale=C". It may be confusing that datcollate is a different concept from datdeflocale; but then again they are different concepts and it's confusing that they are currently combined into one. (3) Support iculocale=C in the ICU provider using the memcmp() path. In other words, if provider=icu and iculocale=C, lc_collate_is_c() and lc_ctpye_is_c() would both return true. There's a potential problem for users who've misused ICU in the past (15 or earlier) by using provider=icu and iculocale=C. ICU would accept such a locale name, but not recognize it and fall back to the root locale, so it never worked as the user intended it. But if we redefine C to be memcmp(), then such users will have broken indexes if they upgrade. We could add a check at pg_upgrade time for iculocale=C in versions 15 and earlier, and cause the check (and therefore the upgrade) to fail. That may be reasonable considering that it never really worked in the past, and perhaps very few users actually ever created such a collation. But if some user runs into that problem, we'd have to resort to a hack like telling them to "update pg_collation set iculocale='und' where iculocale='C'" and then try the upgrade again, which is not a great answer (as far as I can tell it would be a correct answer and should not break their indexes, but it feels pretty dangerous). There may be some other resolutions to this problem, such as catalog hacks that allow for different representations of iculocale=C pre-16 and post-16. That doesn't sound great though, and we'd have to figure out what to do with pg_dump. (4) Create a new "none" provider (which has no locale and always memcmp semantics), and automatically change the provider to "none" if provider=icu and iculocale=C. This solves the problem case in #2 and the potential upgrade problem in #3. It also makes the documentation a bit more natural, in my opinion, even if we retain the special case for provider=libc collate=C. #4 is the approach I chose (patches 0001 and 0002), but I'd like to hear what others think. For historical reasons, users may assume that LC_COLLATE control
Re: Order changes in PG16 since ICU introduction
On Thu, 2023-05-18 at 13:58 -0400, Jonathan S. Katz wrote: > From my read of them, as an app developer I'd be very unlikely to > use > this. Maybe there is something with building out some collation rules > vis-a-vis an extension, but I have trouble imagining the use-case. I > may > also not be the target audience for this feature. That's a problem for the ICU rules feature. I understand some features may be for domain experts only, but we at least need to call that out so that ordinary developers don't get confused. And we should hear from some of those domain experts that they actually want it and it solves a real problem. For the features that can be described with collation settings/attributes right in the locale name, the use cases are more plausible and we've supported them since v10, so it's good to document them as best we can. It's hard to expose only the particular ICU collation settings we understand best (e.g. the "ks" setting that allows case insensitive collation), so it's inevitable that there will be some settings that are more obscure and harder to document. But in the case of ICU rules, they are newly-supported in 16, so there should be a clear reason we're adding them. Otherwise we're just setting up users for confusion or problems, and creating backwards- compatibility headaches for ourselves (and the last thing we want is to fret over backwards compatibility for a feature with no users). Beyond that, there seems to be some danger: if the syntax for rules is not perfectly compatible between ICU versions, the user might run into big problems. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Fri, 21 Apr 2023 at 22:46, Jeff Davis wrote: > > On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote: > > > > > > > > Also, somewhere along the line someone broke initdb --no-locale, > > which > > should result in C locale being the default everywhere, but when I > > just > > tested it it picked 'en' for an ICU locale, which is not the right > > thing. > > Fixed, thank you. As I complain about in [0], since 5cd1a5af --no-locale has been broken / bahiving outside it's description: Instead of being equivalent to `--locale=C` it now also overrides `--locale-provider=libc`, resulting in undocumented behaviour. Kind regards, Matthias van de Meent Neon, Inc. [0] https://www.postgresql.org/message-id/CAEze2WiZFQyyb-DcKwayUmE4rY42Bo6kuK9nBjvqRHYxUYJ-DA%40mail.gmail.com
Re: Order changes in PG16 since ICU introduction
On 5/18/23 1:55 PM, Jeff Davis wrote: On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote: I did a quicker read through this time. LGTM overall. I like what you did with the explanations around sensitivity (now it makes sense). Committed, thank you. \o/ There are a few things I don't understand that would be good to document better: * Rules. I still don't quite understand the use case: are these for people inventing new languages? What is a plausible use case that isn't covered by the existing locales and collation settings? Do rules make sense for a database default collation? Are they for language experts only or might an ordinary developer benefit from using them? From my read of them, as an app developer I'd be very unlikely to use this. Maybe there is something with building out some collation rules vis-a-vis an extension, but I have trouble imagining the use-case. I may also not be the target audience for this feature. * The collation types "phonebk", "emoji", etc.: are these variants of particular locales, or do they make sense in multiple locales? I don't know where they fit in or how to document them. I remember I had a exploratory use case for "phonebk" but I couldn't figure out how to get it to work. AIUI from random searching, the idea is that it provides the "phonebook" rules for ordering "names" in a particular locale, but I couldn't get it to work. * I don't understand what "kc" means if "ks" is not set to "level1". Me neither, but I haven't stared at this as hard as others. Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: Order changes in PG16 since ICU introduction
On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote: > I did a quicker read through this time. LGTM overall. I like what you > did with the explanations around sensitivity (now it makes sense). Committed, thank you. There are a few things I don't understand that would be good to document better: * Rules. I still don't quite understand the use case: are these for people inventing new languages? What is a plausible use case that isn't covered by the existing locales and collation settings? Do rules make sense for a database default collation? Are they for language experts only or might an ordinary developer benefit from using them? * The collation types "phonebk", "emoji", etc.: are these variants of particular locales, or do they make sense in multiple locales? I don't know where they fit in or how to document them. * I don't understand what "kc" means if "ks" is not set to "level1". Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 5/17/23 6:59 PM, Jeff Davis wrote: On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote: Other than that, and I took your suggestions almost verbatim. Patch attached. Thank you! Attached new patch with a typo fix and a few other edits. I plan to commit soon. I did a quicker read through this time. LGTM overall. I like what you did with the explanations around sensitivity (now it makes sense). Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: Order changes in PG16 since ICU introduction
On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote: > Other than that, and I took your suggestions almost verbatim. Patch > attached. Thank you! Attached new patch with a typo fix and a few other edits. I plan to commit soon. Regards, Jeff Davis From d0d2375fa55618b60f361f6bb64b2c49490125b9 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 27 Apr 2023 14:43:46 -0700 Subject: [PATCH] Doc improvements for language tags and custom ICU collations. Separate the documentation for language tags themselves from the available collation settings which can be included in a language tag. Include tables of the available options, more details about the effects of each option, and additional examples. Also include an explanation of the "levels" of textual features and how they relate to collation. Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1...@postgresql.org Reviewed-by: Jonathan S. Katz --- doc/src/sgml/charset.sgml | 683 +++--- 1 file changed, 562 insertions(+), 121 deletions(-) diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 6dd95b8966..6b9c323edd 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en variants and customization options. + + ICU Locales + +ICU Locale Names + + The ICU format for the locale name is a Language Tag. + + +CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP); +CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr'); + + + + +Locale Canonicalization and Validation + + When defining a new ICU collation object or database with ICU as the + provider, the given locale name is transformed ("canonicalized") into a + language tag if not already in that form. For instance, + + +CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true'); +NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true" +CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8'); +NOTICE: using standard form "de-DE" for locale "de_DE.utf8" + + + If you see this notice, ensure that the PROVIDER and + LOCALE are the expected result. For consistent results + when using the ICU provider, specify the canonical language tag instead of relying on the + transformation. + + + A locale with no language name, or the special language name + root, is transformed to have the language + und ("undefined"). + + + ICU can transform most libc locale names, as well as some other formats, + into language tags for easier transition to ICU. If a libc locale name is + used in ICU, it may not have precisely the same behavior as in libc. + + + If there is a problem interpreting the locale name, or if the locale name + represents a language or region that ICU does not recognize, you will see + the following warning: + + +CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense'); +WARNING: ICU locale "nonsense" has unknown language "nonsense" +HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. +CREATE COLLATION + + + controls how the message is + reported. Unless set to ERROR, the collation will + still be created, but the behavior may not be what the user intended. + + + +Language Tag + + A language tag, defined in BCP 47, is a standardized identifier used to + identify languages, regions, and other information about a locale. + + + Basic language tags are simply + language-region; + or even just language. The + language is a language code + (e.g. fr for French), and + region is a region code + (e.g. CA for Canada). Examples: + ja-JP, de, or + fr-CA. + + + Collation settings may be included in the language tag to customize + collation behavior. ICU allows extensive customization, such as + sensitivity (or insensitivity) to accents, case, and punctuation; + treatment of digits within text; and many other options to satisfy a + variety of uses. + + + To include this additional collation information in a language tag, + append -u, which indicates there are additional + collation settings, followed by one or more + -key-value + pairs. The key is the key for a collation setting and + value is a valid value for that setting. For + boolean settings, the -key + may be specified without a corresponding + -value, which implies a + value of true. + + + For example, the language tag en-US-u-kn-ks-level2 + means the locale with the English language in the US region, with + collation settings kn set to true + and ks set to level2. Those + settings mean the collation will be case-insensitive and treat a sequence + of digits as a single number: + +CREATE COLL
Re: Order changes in PG16 since ICU introduction
On Tue, 2023-05-16 at 15:35 -0400, Jonathan S. Katz wrote: > + Sensitivity when determining equality, with > + level1 the least sensitive and > + identic the most sensitive. See + linkend="icu-collation-levels"/> for details. > > This discusses equality sensitivity, but I'm not sure if I understand > that term here. The ICU docs seem to call these "strengths"[1], maybe > we > use that term to be consistent with upstream? "Sensitivity" comes from "case sensitivity" which is more clear to me than "strength". I added the term "strength" to correspond to the unicode terminology, but I kept sensitivity and I tried to make it slightly more clear. Other than that, and I took your suggestions almost verbatim. Patch attached. Thank you! I also made a few other changes: * added paragraph transformation of '' or 'root' to the 'und' language (root collation) * added paragraph that the "identic" level still performs some basic normalization * added example for when full normalization matters I should also say that I don't really understand the case when "kc" is set to true and "ks" is level 2 or higher. If someone has an example of where that matters, let me know. Regards, Jeff Davis From 8633ec205b0b0297910cef8f931092d0c05eb3ce Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 27 Apr 2023 14:43:46 -0700 Subject: [PATCH v2] Doc improvements for language tags and custom ICU collations. Separate the documentation for language tags themselves from the available collation settings which can be included in a language tag. Include tables of the available options, more details about the effects of each option, and additional examples. Also include an explanation of the "levels" of textual features and how they relate to collation. Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1...@postgresql.org Reviewed-by: Jonathan S. Katz --- doc/src/sgml/charset.sgml | 680 +++--- 1 file changed, 559 insertions(+), 121 deletions(-) diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 6dd95b8966..ea43732ec9 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en variants and customization options. + + ICU Locales + +ICU Locale Names + + The ICU format for the locale name is a Language Tag. + + +CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP); +CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr'); + + + + +Locale Canonicalization and Validation + + When defining a new ICU collation object or database with ICU as the + provider, the given locale name is transformed ("canonicalized") into a + language tag if not already in that form. For instance, + + +CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true'); +NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true" +CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8'); +NOTICE: using standard form "de-DE" for locale "de_DE.utf8" + + + If you see this notice, ensure that the PROVIDER and + LOCALE are the expected result. For consistent results + when using the ICU provider, specify the canonical language tag instead of relying on the + transformation. + + + A locale with no language name, or the special language name + root, is transformed to have the language + und ("undefined"). + + + ICU can transform most libc locale names, as well as some other formats, + into language tags for easier transition to ICU. If a libc locale name is + used in ICU, it may not have precisely the same behavior as in libc. + + + If there is a problem interpreting the locale name, or if the locale name + represents a language or region that ICU does not recognize, you will see + the following error: + + +CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense'); +ERROR: ICU locale "nonsense" has unknown language "nonsense" +HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. + + + controls how the message is + reported. If set below ERROR, the collation will still + be created, but the behavior may not be what the user intended. + + + +Language Tag + + A language tag, defined in BCP 47, is a standardized identifier used to + identify languages, regions, and other information about a locale. + + + Basic language tags are simply + language-region; + or even just language. The + language is a language code + (e.g. fr for French), and + region is a region code + (e.g. CA for Canada). Examples: + ja-JP, de, or + fr-CA. + + + Collation settings may be included in the language tag to customize + collation behavior. ICU allows extensive customization, s
Re: Order changes in PG16 since ICU introduction
On 5/5/23 8:25 PM, Jeff Davis wrote: On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote: On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis wrote: Most of the complaints seem to be complaints about v15 as well, and while those complaints may be a reason to not make ICU the default, they are also an argument that we should continue to learn and try to fix those issues because they exist in an already-released version. Leaving it the default for now will help us fix those issues rather than hide them. It's still early, so we have plenty of time to revert the initdb default if we need to. That's fair enough, but I really think it's important that some energy get invested in providing adequate documentation for this stuff. Just patching the code is not enough. Attached a significant documentation patch. I tried to make it comprehensive without trying to be exhaustive, and I separated the explanation of language tags from what collation settings you can include in a language tag, so hopefully that's more clear. I added quite a few examples spread throughout the various sections, and I preserved the existing examples at the end. I also left all of the external links at the bottom for those interested enough to go beyond what's there. [Personal hat, not RMT] Thanks -- this is super helpful. A bunch of these examples I had previously had to figure out by randomly searching blog posts / trial-and-error, so I think this will help developers get started more quickly. Comments (and a lot are just little nits to tighten the language) Commit message -- typo: "documentaiton" + If you see such a message, ensure that the PROVIDER and + LOCALE are as you expect, and consider specifying + directly as the canonical language tag instead of relying on the + transformation. + I'd recommend make this more prescriptive: "If you see this notice, ensure that the PROVIDER and LOCALE are the expected result. For consistent results when using the ICU provider, specify the canonical linkend="icu-language-tag">language tag instead of relying on the transformation." + If there is some problem interpreting the locale name, or if it represents + a language or region that ICU does not recognize, a message will be reported: This is passive voice, consider: "If there is a problem interpreting the locale name, or if the locale name represents a language or region that ICU does not recognize, you'll see the following error:" + +Language Tag + Before jumping in, I'd recommend a quick definition of what a language tag is, e.g.: "A language tag, defined in BCP 47, is a standardized identifier used to identify languages in computer systems" or something similar. (I did find a database that made it simpler to search for these, which is one issue I've previously add, but I don't think we'd want to link to i) + To include this additional collation information in a language tag, + append -u, followed by one or more My first question was "what's special about '-u'", so maybe we say: "To include this additional collation information in a language tag, append -u, which indicates there are additional collation settings, followed by one or more..." + ICU locales are specified as a linkend="icu-language-tag">Language + Tag, but can also accept most libc-style locale names (which will + be transformed into language tags if possible). + I'd recommend removing the parantheticals: ICU locales are specified as a BCP 47 linkend="icu-language-tag">Language Tag, but can also accept most libc-style locale names. If possible, libc-style locale names are transformed into language tags. + ICU Collation Levels Nothing to add here other than to say I'm extremely appreciative of this section. Once upon a time I sunk a lot of time trying to figure out how all of these levels worked. + Sensitivity when determining equality, with + level1 the least sensitive and + identic the most sensitive. See for details. This discusses equality sensitivity, but I'm not sure if I understand that term here. The ICU docs seem to call these "strengths"[1], maybe we use that term to be consistent with upstream? + If set to upper, upper case sorts before lower + case. If set to lower, lower case sorts before + upper case. If set to false, it depends on the + locale. Suggestion to tighten this up: "If set to false, the sort depends on the rules of the locale." + Defaults may depend on locale. The above table is not meant to be + complete. See for additinal + options and details. Typo: additinal => "additional" I didn't add additional documentation for ICU rules. There are so many options for collations that it's hard for me to think of realistic examples to specify the rules directly, unless someone wants to invent a new language. Perhaps useful if working with an interes
Re: Order changes in PG16 since ICU introduction
On Tue, 2023-05-16 at 19:00 +0300, Alexander Lakhin wrote: > I'm not sure about the proposed change in icu_from_uchar(). It seems > that > len_result + 1 bytes should always be enough for the result string > terminated > with NUL. If that's not true (we want to protect from some ICU bug > here), > then the change should be backpatched? I believe it's enough and I'm not aware of any bug there so no backport is required. I added checks in places that were (a) checking for U_FAILURE; and (b) expecting the result to be NUL-terminated. That's mostly callers of uloc_getLanguage(), where I was not quite paranoid enough. There were a couple other places though, and I went ahead and added checks there out of paranoia, too. One was ucnv_fromUChars(), and the other was uloc_canonicalize(). Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
Hi Jeff, 16.05.2023 00:03, Jeff Davis wrote: On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote: On the current master (after 455f948b0, and before f7faa9976, of course) I get an ASAN-detected failure with the following query: CREATE COLLATION col (provider = icu, locale = '123456789012'); Thank you for the report! ICU source specifically says: /** * Useful constant for the maximum size of the language part of a locale ID. * (including the terminating NULL). * @stable ICU 2.0 */ #define ULOC_LANG_CAPACITY 12 So the fact that it returning success without nul-terminating the result is an ICU bug in my opinion, and I reported it here: https://unicode-org.atlassian.net/browse/ICU-22394 Unfortunately that means we need to be a bit more paranoid and always check for that warning, even if we have a preallocated buffer of the "correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially scary), unless we check for those errors each time and report specific errors for them. Another approach is to always check the length and loop using dynamic allocation so that we never overflow (and forget about any static buffers). That seems like overkill given that the problem case is obviously invalid input; I think as long as we catch it and throw an ERROR it's fine. But I can do this if others think it's worthwhile. Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING in a few places and turns it into an ERROR. It also cleans up the loop for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING rather than (U_SUCCESS(status) && len >= buflen). I'm not sure about the proposed change in icu_from_uchar(). It seems that len_result + 1 bytes should always be enough for the result string terminated with NUL. If that's not true (we want to protect from some ICU bug here), then the change should be backpatched? Best regards, Alexander
Re: Order changes in PG16 since ICU introduction
On Mon, 2023-05-08 at 14:59 -0700, Jeff Davis wrote: > The easiest thing to do is revert it for now, and after we sort out > the > memcmp() path for the ICU provider, then I can commit it again (after > that point it would just be code cleanup and should have no > functional > impact). The conversion won't be entirely dead code even after we handle the "C" locale with memcmp(): for a locale like "C.UTF-8", it will still be passed to the collation provider (same as with libc), and in that case, we should still convert that to a language tag consistently across ICU versions. For it to be entirely dead code, we would need to convert any locale with language "C" (e.g. "C.UTF-8") to use the memcmp() path. I'm fine with that, but that's not what the libc provider does today, and perhaps we should be consistent between the two. If we do leave the code in place, we can document that specific "en-US-u-va-posix" locale so that it's not too surprising for users. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote: > On the current master (after 455f948b0, and before f7faa9976, of > course) > I get an ASAN-detected failure with the following query: > CREATE COLLATION col (provider = icu, locale = '123456789012'); > Thank you for the report! ICU source specifically says: /** * Useful constant for the maximum size of the language part of a locale ID. * (including the terminating NULL). * @stable ICU 2.0 */ #define ULOC_LANG_CAPACITY 12 So the fact that it returning success without nul-terminating the result is an ICU bug in my opinion, and I reported it here: https://unicode-org.atlassian.net/browse/ICU-22394 Unfortunately that means we need to be a bit more paranoid and always check for that warning, even if we have a preallocated buffer of the "correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially scary), unless we check for those errors each time and report specific errors for them. Another approach is to always check the length and loop using dynamic allocation so that we never overflow (and forget about any static buffers). That seems like overkill given that the problem case is obviously invalid input; I think as long as we catch it and throw an ERROR it's fine. But I can do this if others think it's worthwhile. Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING in a few places and turns it into an ERROR. It also cleans up the loop for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING rather than (U_SUCCESS(status) && len >= buflen). -- Jeff Davis PostgreSQL Contributor Team - AWS From 9c8e9272ca807c9f75a7b32fa3190700cc600260 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Mon, 15 May 2023 13:35:07 -0700 Subject: [PATCH] ICU: check for U_STRING_NOT_TERMINATED_WARNING. In some cases, ICU can fail to NUL-terminate a result string even if using an appropriately-sized static buffer. The caller must either check for len >= buflen or U_STRING_NOT_TERMINATED_WARNING. The specific problem is related to uloc_getLanguage(), but add the check in other cases as well. Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/2098874d-c111-41e4-9063-30bcf1352...@gmail.com --- src/backend/utils/adt/pg_locale.c | 29 +++-- src/bin/initdb/initdb.c | 15 --- 2 files changed, 15 insertions(+), 29 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index f0b6567da1..1cf93b2d20 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2468,7 +2468,7 @@ pg_ucol_open(const char *loc_str) status = U_ZERO_ERROR; uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) + if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING) { ereport(ERROR, (errmsg("could not get language from locale \"%s\": %s", @@ -2504,7 +2504,7 @@ pg_ucol_open(const char *loc_str) * Pretend the error came from ucol_open(), for consistent error * message across ICU versions. */ - if (U_FAILURE(status)) + if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING) { ucol_close(collator); ereport(ERROR, @@ -2639,7 +2639,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar) status = U_ZERO_ERROR; len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1, buff_uchar, len_uchar, &status); - if (U_FAILURE(status)) + if (U_FAILURE(status) || + status == U_STRING_NOT_TERMINATED_WARNING) ereport(ERROR, (errmsg("%s failed: %s", "ucnv_fromUChars", u_errorName(status; @@ -2681,7 +2682,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc, icu_locale_id = palloc(len + 1); *status = U_ZERO_ERROR; len = uloc_canonicalize(loc, icu_locale_id, len + 1, status); - if (U_FAILURE(*status)) + if (U_FAILURE(*status) || *status == U_STRING_NOT_TERMINATED_WARNING) return; lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id)); @@ -2765,7 +2766,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc, pfree(lower_str); } - #endif /* @@ -2789,7 +2789,7 @@ icu_language_tag(const char *loc_str, int elevel) status = U_ZERO_ERROR; uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) + if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING) { if (elevel > 0) ereport(elevel, @@ -2811,19 +2811,12 @@ icu_language_tag(const char *loc_str, int elevel) langtag = palloc(buflen); while (true) { - int32_t len; - status = U_ZERO_ERROR; - len = uloc_toLanguageTag(loc_str, langtag, buf
Re: Order changes in PG16 since ICU introduction
On 11.05.23 23:29, Jeff Davis wrote: New patch series attached. === 0001: fix bug that allows creating hidden collations Bug: https://www.postgresql.org/message-id/051c9395cf880307865ee8b17acdbf7f838c1e39.ca...@j-davis.com This is still being debated in the other thread. Not really related to this thread, so I suggest dropping it from this patch series. === 0002: handle some kinds of libc-stlye locale strings ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't in later versions. Handle them in postgres for consistency. I tend to agree with ICU that these variants are obsolete, and we don't need to support them anymore. If this were a tiny patch, then maybe ok, but the way it's presented here the whole code is duplicated between pg_locale.c and initdb.c, which is not great. === 0003: reduce icu_validation_level to WARNING Given that we've seen some inconsistency in which locale names are accepted in different ICU versions, it seems best not to be too strict. Peter Eisentraut suggested that it be set to ERROR originally, but a WARNING should be sufficient to see problems without introducing risks migrating to version 16. I'm not sure why this is the conclusion. Presumably, the detection capabilities of ICU improve over time, so we want to take advantage of that? What are some example scenarios where this change would help? === 0004-0006: To solve the issues that have come up in this thread, we need CREATE DATABASE (and createdb and initdb) to use LOCALE to mean the collation locale regardless of which provider is in use (which is what 0006 does). 0006 depends on ICU handling libc locale names. It already does a good job for most libc locale names (though patch 0002 fixes a few cases where it doesn't). There may be more cases, but for the most part libc names are interpreted in a reasonable way. But one important case is missing: ICU does not handle the "C" locale as we expect (that is, using memcmp()). We've already allowed users to create ICU collations with the C locale in the past, which uses the root collation (not memcmp()), and we need to keep supporting that for upgraded clusters. I'm not sure I agree that we need to keep supporting that. The only way you could get that in past releases is if you specify explicitly, "give me provider ICU and locale C", and then it wouldn't actually even work correctly. So nobody should be using that in practice, and nobody should have stumbled into that combination of settings by accident. 3. Introduce collation provider "none", which is always memcmp-based (patch 0004). It's equivalent to the libc locale=C, but it allows specifying the LC_COLLATE and LC_CTYPE independently. A command like CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C' LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none" (patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For the database default collation, that would always use memcmp(), but the server environment LC_COLLATE would be set to en_US as the user specified. This seems most attractive, but I think it's quite invasive at this point, especially given the dubious premise (see above). === 0007: Add a GUC to control the default collation provider Having a GUC would make it easier to migrate to ICU without surprises. This only affects the default for CREATE COLLATION, not CREATE DATABASE (and obviously not initdb). It's not clear to me why we would want that. Also not clear why it should only affect CREATE COLLATION.
Re: Order changes in PG16 since ICU introduction
Hello Jeff, 09.05.2023 00:59, Jeff Davis wrote: The easiest thing to do is revert it for now, and after we sort out the memcmp() path for the ICU provider, then I can commit it again (after that point it would just be code cleanup and should have no functional impact). On the current master (after 455f948b0, and before f7faa9976, of course) I get an ASAN-detected failure with the following query: CREATE COLLATION col (provider = icu, locale = '123456789012'); ==2929883==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc491be09c at pc 0x556e8571a260 bp 0x7 ffc491be020 sp 0x7ffc491bd7c8 READ of size 15 at 0x7ffc491be09c thread T0 #0 0x556e8571a25f in __interceptor_strcmp.part.0 (.../usr/local/pgsql/bin/postgres+0x2aa025f) #1 0x556e86d77ee6 in icu_language_tag .../src/backend/utils/adt/pg_locale.c:2802 ... Address 0x7ffc491be09c is located in stack of thread T0 at offset 76 in frame #0 0x556e86d77cfe in icu_language_tag .../src/backend/utils/adt/pg_locale.c:2782 This frame has 2 object(s): [48, 52) 'status' (line 2784) [64, 76) 'lang' (line 2785) <== Memory access at offset 76 overflows this variable ... Here, uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status) returns status = -124, i.e., U_STRING_NOT_TERMINATED_WARNING = -124,/**< An output string could not be NUL-terminated because output length==destCapacity. */ (ULOC_LANG_CAPACITY = 12) this value is not covered by U_FAILURE(status), and strcmp(), that follows, goes out of the lang variable bounds. Best regards, Alexander
Re: Order changes in PG16 since ICU introduction
New patch series attached. === 0001: fix bug that allows creating hidden collations Bug: https://www.postgresql.org/message-id/051c9395cf880307865ee8b17acdbf7f838c1e39.ca...@j-davis.com === 0002: handle some kinds of libc-stlye locale strings ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't in later versions. Handle them in postgres for consistency. === 0003: reduce icu_validation_level to WARNING Given that we've seen some inconsistency in which locale names are accepted in different ICU versions, it seems best not to be too strict. Peter Eisentraut suggested that it be set to ERROR originally, but a WARNING should be sufficient to see problems without introducing risks migrating to version 16. I don't expect objections to 0003, so I may commit this soon, but I'll give it a little time in case someone has an opinion. === 0004-0006: To solve the issues that have come up in this thread, we need CREATE DATABASE (and createdb and initdb) to use LOCALE to mean the collation locale regardless of which provider is in use (which is what 0006 does). 0006 depends on ICU handling libc locale names. It already does a good job for most libc locale names (though patch 0002 fixes a few cases where it doesn't). There may be more cases, but for the most part libc names are interpreted in a reasonable way. But one important case is missing: ICU does not handle the "C" locale as we expect (that is, using memcmp()). We've already allowed users to create ICU collations with the C locale in the past, which uses the root collation (not memcmp()), and we need to keep supporting that for upgraded clusters. So that leaves us with a catalog representation problem. I mentioned upthread that we can solve that by: 1. Using iculocale=NULL to mean "C-as-in-memcmp", or having some other catalog hack (like another field). That's not desirable because the catalog representation is already complex and it may be hard for users to tell what's happening. 2. When provider=icu and locale=C, switch to provider=libc locale=C. This is very messy, because currently the syntax allows specifying a database with LOCALE_PROVIDER='icu' ICU_LOCALE='C' LC_COLLATE='en_US' - - if the provider gets changed to libc, what would we set datcollate to? I don't think this is workable without some breakage. We can't simply override datcollate to be C in that case, because there are some things other than the default collation that might need it set to en_US as the user specified. 3. Introduce collation provider "none", which is always memcmp-based (patch 0004). It's equivalent to the libc locale=C, but it allows specifying the LC_COLLATE and LC_CTYPE independently. A command like CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C' LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none" (patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For the database default collation, that would always use memcmp(), but the server environment LC_COLLATE would be set to en_US as the user specified. For this patch series, I chose approach #3. I think it works out nicely -- it provides a better place to document the "no locale" behavior (including a warning that it depends on the database encoding), and I think it's more clear to the user that locale=C is not actually using a provider at all. It's more invasive, but feels like a better solution. If others don't like it I can implement approach #1 instead. === 0007: Add a GUC to control the default collation provider Having a GUC would make it easier to migrate to ICU without surprises. This only affects the default for CREATE COLLATION, not CREATE DATABASE (and obviously not initdb). -- Jeff Davis PostgreSQL Contributor Team - AWS From fc66f02976bb11b629bcf71346c2858eccbcf1a3 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 11 May 2023 10:36:04 -0700 Subject: [PATCH v5 1/7] For user-defined collations, never set collencoding=-1. For new user-defined collations, always set collencoding to the current database encoding so that it is never shadowed by a built-in collation. Built in collations that work with any encoding may have collencoding=-1, and if a user defines a collation with the same name, it will shadow the built-in collation. Previously it was possible to create an ICU collation (which was assigned collencoding=-1) that was shadowed by a built-in collation and completely inaccessible. --- src/backend/commands/collationcmds.c | 28 +-- .../regress/expected/collate.icu.utf8.out | 2 +- 2 files changed, 21 insertions(+), 9 deletions(-) diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c index c91fe66d9b..a53700256b 100644 --- a/src/backend/commands/collationcmds.c +++ b/src/backend/commands/collationcmds.c @@ -302,16 +302,29 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), e
Re: Order changes in PG16 since ICU introduction
On 09.05.23 17:09, Jeff Davis wrote: It's awkward for a user to read pg_database.datlocprovider, then depending on that, either look in datcollate or daticulocale. (It's awkward in the code, too.) Maybe some built-in function that returns a tuple of the default provider, the locale, and the version? Or should we also output the ctype somehow (which affects the results of upper()/lower())? There is also the deterministic flag and the icurules setting. Depending on what level of detail you imagine the user needs, you really do need to look at the whole picture, not some subset of it.
Re: Order changes in PG16 since ICU introduction
On 09.05.23 10:25, Alvaro Herrera wrote: On 2023-Apr-24, Peter Eisentraut wrote: The GUC settings lc_collate and lc_ctype are from a time when those locale settings were cluster-global. When we made those locale settings per-database (PG 8.4), we kept them as read-only. As of PG 15, you can use ICU as the per-database locale provider, so what is being attempted in the above example is already meaningless before PG 16, since you need to look into pg_database to find out what is really happening. I think we should just remove the GUC parameters lc_collate and lc_ctype. I agree with removing these in v16, since they are going to become more meaningless and confusing. Here is my proposed patch for this. From b548a671ad02a5c851a4984db6e4535a0b70f881 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Thu, 11 May 2023 13:02:02 +0200 Subject: [PATCH] Remove read-only server settings lc_collate and lc_ctype The GUC settings lc_collate and lc_ctype are from a time when those locale settings were cluster-global. When those locale settings were made per-database (PG 8.4), the settings were kept as read-only. As of PG 15, you can use ICU as the per-database locale provider, so examining these settings is already meaningless, since you need to look into pg_database to find out what is really happening. Discussion: https://www.postgresql.org/message-id/696054d1-bc88-b6ab-129a-18b8bce6a...@enterprisedb.com --- contrib/citext/expected/citext_utf8.out | 4 +-- contrib/citext/expected/citext_utf8_1.out | 4 +-- contrib/citext/sql/citext_utf8.sql| 4 +-- doc/src/sgml/config.sgml | 32 --- src/backend/utils/init/postinit.c | 4 --- src/backend/utils/misc/guc_tables.c | 26 --- .../regress/expected/collate.icu.utf8.out | 4 +-- .../regress/expected/collate.linux.utf8.out | 6 ++-- .../expected/collate.windows.win1252.out | 6 ++-- src/test/regress/sql/collate.icu.utf8.sql | 4 +-- src/test/regress/sql/collate.linux.utf8.sql | 6 ++-- .../regress/sql/collate.windows.win1252.sql | 6 ++-- 12 files changed, 22 insertions(+), 84 deletions(-) diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out index 77b4586d8f..6630e09a4d 100644 --- a/contrib/citext/expected/citext_utf8.out +++ b/contrib/citext/expected/citext_utf8.out @@ -8,8 +8,8 @@ * to the "tr-TR-x-icu" collation where it will succeed. */ SELECT getdatabaseencoding() <> 'UTF8' OR - current_setting('lc_ctype') = 'C' OR - (SELECT datlocprovider='i' FROM pg_database + (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i' +FROM pg_database WHERE datname=current_database()) AS skip_test \gset \if :skip_test diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out index d1e1fe1a9d..3caa7a00d4 100644 --- a/contrib/citext/expected/citext_utf8_1.out +++ b/contrib/citext/expected/citext_utf8_1.out @@ -8,8 +8,8 @@ * to the "tr-TR-x-icu" collation where it will succeed. */ SELECT getdatabaseencoding() <> 'UTF8' OR - current_setting('lc_ctype') = 'C' OR - (SELECT datlocprovider='i' FROM pg_database + (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i' +FROM pg_database WHERE datname=current_database()) AS skip_test \gset \if :skip_test diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql index 8530c68dd7..1f51df134b 100644 --- a/contrib/citext/sql/citext_utf8.sql +++ b/contrib/citext/sql/citext_utf8.sql @@ -9,8 +9,8 @@ */ SELECT getdatabaseencoding() <> 'UTF8' OR - current_setting('lc_ctype') = 'C' OR - (SELECT datlocprovider='i' FROM pg_database + (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i' +FROM pg_database WHERE datname=current_database()) AS skip_test \gset \if :skip_test diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 909a3f28c7..3e9030e3d7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -10788,38 +10788,6 @@ Preset Options - - lc_collate (string) - - lc_collate configuration parameter - - - - -Reports the locale in which sorting of textual data is done. -See for more information. -This value is determined when a database is created. - - - - - - lc_ctype (string) - - lc_ctype configuration parameter - - - - -Reports the locale that determines character classifications. -See for more information. -This value is determined when a database is created. -Ordinarily this will be the same as lc_collate, -but for special applications it might be set differently. - - -
Re: Order changes in PG16 since ICU introduction
On Tue, 2023-05-09 at 10:25 +0200, Alvaro Herrera wrote: > I agree with removing these in v16, since they are going to become > more > meaningless and confusing. Agreed, but it would be nice to have an alternative that does the right thing. It's awkward for a user to read pg_database.datlocprovider, then depending on that, either look in datcollate or daticulocale. (It's awkward in the code, too.) Maybe some built-in function that returns a tuple of the default provider, the locale, and the version? Or should we also output the ctype somehow (which affects the results of upper()/lower())? Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 2023-Apr-24, Peter Eisentraut wrote: > The GUC settings lc_collate and lc_ctype are from a time when those locale > settings were cluster-global. When we made those locale settings > per-database (PG 8.4), we kept them as read-only. As of PG 15, you can use > ICU as the per-database locale provider, so what is being attempted in the > above example is already meaningless before PG 16, since you need to look > into pg_database to find out what is really happening. > > I think we should just remove the GUC parameters lc_collate and lc_ctype. I agree with removing these in v16, since they are going to become more meaningless and confusing. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: Order changes in PG16 since ICU introduction
On Mon, 2023-05-08 at 17:47 -0400, Tom Lane wrote: > -ERROR: could not convert locale name "C" to language tag: > U_ILLEGAL_ARGUMENT_ERROR > +NOTICE: using standard form "en-US-u-va-posix" for locale "C" ... > I suppose this is environment-dependent. Sadly, the buildfarm > client does not show the prevailing LANG or LC_XXX settings. Looks like it's failing-to-fail on some versions of ICU which automatically perform that conversion. The easiest thing to do is revert it for now, and after we sort out the memcmp() path for the ICU provider, then I can commit it again (after that point it would just be code cleanup and should have no functional impact). Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
Jeff Davis writes: > === 0001: do not convert C to en-US-u-va-posix > I plan to commit this soon. Several buildfarm animals have failed since this went in. The only one showing enough info to diagnose is siskin [1]: @@ -1043,16 +1043,15 @@ ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense" HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. CREATE COLLATION testx (provider = icu, locale = 'C'); -- fails -ERROR: could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR +NOTICE: using standard form "en-US-u-va-posix" for locale "C" CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR SET icu_validation_level = WARNING; CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx; WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR +ERROR: collation "testx" already exists CREATE COLLATION testx (provider = icu, locale = 'C'); DROP COLLATION testx; -WARNING: could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR -WARNING: ICU locale "C" has unknown language "c" -HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. +NOTICE: using standard form "en-US-u-va-posix" for locale "C" CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx; WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense" HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. I suppose this is environment-dependent. Sadly, the buildfarm client does not show the prevailing LANG or LC_XXX settings. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=siskin&dt=2023-05-08%2020%3A09%3A26
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote: > On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis wrote: > > Most of the complaints seem to be complaints about v15 as well, and > > while those complaints may be a reason to not make ICU the default, > > they are also an argument that we should continue to learn and try > > to > > fix those issues because they exist in an already-released version. > > Leaving it the default for now will help us fix those issues rather > > than hide them. > > > > It's still early, so we have plenty of time to revert the initdb > > default if we need to. > > That's fair enough, but I really think it's important that some > energy > get invested in providing adequate documentation for this stuff. Just > patching the code is not enough. Attached a significant documentation patch. I tried to make it comprehensive without trying to be exhaustive, and I separated the explanation of language tags from what collation settings you can include in a language tag, so hopefully that's more clear. I added quite a few examples spread throughout the various sections, and I preserved the existing examples at the end. I also left all of the external links at the bottom for those interested enough to go beyond what's there. I didn't add additional documentation for ICU rules. There are so many options for collations that it's hard for me to think of realistic examples to specify the rules directly, unless someone wants to invent a new language. Perhaps useful if working with an interesting text file format with special treatment for delimiters? I asked the question about rules here: https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at and got some limited response about addressing sort complaints. That sounds reasonable, but a lot of that can also be handled just by specifying the right collation settings. Someone who understands the use case better could add some more documentation. -- Jeff Davis PostgreSQL Contributor Team - AWS From b09515bfaf5e9de330138ec4a627d02a7947de1a Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 27 Apr 2023 14:43:46 -0700 Subject: [PATCH v1] Doc improvements for language tags and custom ICU collations. Separate the documentation for language tags from the documentaiton for the available collation settings which can be included in a language tag. Include tables of the available options, more details about the effects of each option, and additional examples. Also include an explanation of the "levels" of textual features and how they relate to collation. --- doc/src/sgml/charset.sgml | 656 +++--- 1 file changed, 535 insertions(+), 121 deletions(-) diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 6dd95b8966..be74064168 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -377,7 +377,125 @@ initdb --locale-provider=icu --icu-locale=en variants and customization options. + + ICU Locales + +ICU Locale Names + + The ICU format for the locale name is a Language Tag. + + +CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP); +CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr'); + + + + +Locale Canonicalization and Validation + + When defining a new ICU collation object or database with ICU as the + provider, the given locale name is transformed ("canonicalized") into a + language tag if not already in that form. For instance, + + +CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true'); +NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true" +CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8'); +NOTICE: using standard form "de-DE" for locale "de_DE.utf8" + + + If you see such a message, ensure that the PROVIDER and + LOCALE are as you expect, and consider specifying + directly as the canonical language tag instead of relying on the + transformation. + + + + ICU can transform most libc locale names, as well as some other formats, + into language tags for easier transition to ICU. If a libc locale name + is used in ICU, it may not have precisely the same behavior as in libc. + + + + If there is some problem interpreting the locale name, or if it represents + a language or region that ICU does not recognize, a message will be reported: + +SET icu_validation_level = ERROR; +CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense'); +ERROR: ICU locale "nonsense" has unknown language "nonsense" +HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. + + + controls how the message is + reported. If set below ERROR, the collation will still + be created, but the behavior may not be what the user intended. + + + +Language Tag + + Basic language tags are simply
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-28 at 14:35 -0700, Jeff Davis wrote: > On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote: > > This should be pg_strcasecmp(...) == 0 > > Good catch, thank you! Fixed in updated patches. Rebased patches. === 0001: do not convert C to en-US-u-va-posix I plan to commit this soon. If someone specifies "C", they are probably expecting memcmp()-like behavior, or some kind of error/warning that it can't be provided. Removing this transformation means that if you specify iculocale=C, you'll get an error or warning (depending on icu_validation_level), because C is not a recognized icu locale. Depending on how some of the other issues in this thread are sorted out, we may want to relax the validation. === 0002: fix @euro, etc. in ICU >= 64 I'd like to commit this soon too, but I'll wait for someone to take a look. It makes it more reliable to map libc names to icu locale names regardless of the ICU version. It doesn't solve the problem for locales like "de__PHONEBOOK", but those don't seem to be a libc format (I think just an old ICU format), so I don't see a big reason to carry it forward. It might be another reason to turn down the validation level to WARNING, though. === 0003: support C memcmp() behavior with ICU provider The current patch 0003 has a problem, because in previous postgres versions (going all the way back), we allowed "C" as a valid ICU locale, that would actually be passed to ICU as a locale name. But ICU didn't recognize it, and it would end up opening the root locale. So we can't simply redefine "C" to mean "memcmp", because that would potentially break indexes. I see the following potential solutions: 1. Represent the memcmp behavior with iculocale=NULL, or some other catalog hack, so that we can distinguish between a locale "C" upgraded from a previous version (which should pass "C" to ICU and get the root locale), and a new collation defined with locale "C" (which should have memcmp behavior). The catalog representation for locale information is already complex, so I'm not excited about this option, but it will work. 2. When provider=icu and locale=C, magically transform that into provider=libc to get memcmp-like behavior for new collations but preserve the existing behavior for upgraded collations. Not especially clean, but if we issue a NOTICE perhaps that would avoid confusion. 3. Like #2, except create a new provider type "none" which may be slightly less confusing. === 0004: make LOCALE apply to ICU for CREATE DATABASE To understand this patch it helps to understand the confusing situation with CREATE DATABASE in version 15: The keywords LC_CTYPE and LC_COLLATE set the server environment LC_CTYPE/LC_COLLATE for that database and can be specified regardless of the provider. LOCALE can be specified along with (or instead of) LC_CTYPE and LC_COLLATE, in which case whichever of LC_CTYPE or LC_COLLATE is unspecified defaults to the setting of LOCALE. Iff the provider is libc, LC_CTYPE and LC_COLLATE also act as the database default collation's locale. If the provider is icu, then none of LOCALE, LC_CTYPE, or LC_COLLATE affect the database default collation's locale at all; that's controlled by ICU_LOCALE (which may be omitted if the template's daticulocale is non-NULL). The idea of patch 0004 is to address the last part, which is probably the most confusing aspect. But for that to work smoothly, we need something like 0003 so that LOCALE=C gives the same semantics regardless of the provider. Regards, Jeff Davis From ddda683963959a175dff17ab0e3d8519641498b9 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Fri, 21 Apr 2023 14:03:57 -0700 Subject: [PATCH v4 1/4] ICU: do not convert locale 'C' to 'en-US-u-va-posix'. The conversion was intended to be for convenience, but it's more likely to be confusing than useful. The user can still directly specify 'en-US-u-va-posix' if desired. Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com --- src/backend/utils/adt/pg_locale.c | 19 +-- src/bin/initdb/initdb.c | 17 + .../regress/expected/collate.icu.utf8.out | 8 src/test/regress/sql/collate.icu.utf8.sql | 4 4 files changed, 14 insertions(+), 34 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index f0b6567da1..51b4221a39 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel) { #ifdef USE_ICU UErrorCode status; - char lang[ULOC_LANG_CAPACITY]; char *langtag; size_t buflen = 32; /* arbitrary starting buffer size */ const bool strict = true; - status = U_ZERO_ERROR; - uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) - { - if (elevel > 0) - ereport(elevel, - (errmsg("could not get language from locale \"
Re: Order changes in PG16 since ICU introduction
On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote: > This should be pg_strcasecmp(...) == 0 Good catch, thank you! Fixed in updated patches. > postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9 > template > 'template0'; > ERROR: could not convert locale name "fr_FR@euro" to language tag: > U_ILLEGAL_ARGUMENT_ERROR ICU 63 and earlier convert it without error to the language tag 'fr-FR- u-cu-eur', which is correct. ICU 64 removed support for transforming some locale variants, because apparently they think those variants are obsolete: https://unicode-org.atlassian.net/browse/ICU-22268 https://unicode-org.atlassian.net/browse/ICU-20187 (Aside: how obsolete are those variants?) It's frustrating that they'd remove such transformations from the canonicalization process. Fortunately, it looks like it's easy enough to do the transformation ourselves. The only problematic format is '...@VARIANT'. The other format 'fr_FR_EURO' doesn't seem to be a valid glibc locale name[1] and windows seems to use BCP 47[2]. And there don't seem to be a lot of variants to handle. ICU 63 only handles 3 variants, so that's what my patch does. Any unknown variant between 5 and 8 characters won't throw an error. There could be more problem cases, but I'm not sure how much of a practical problem they are. If we try to keep the meaning of LOCALE to only LC_COLLATE and LC_CTYPE, that will continue to be confusing for anyone that uses provider=icu. Regards, Jeff Davis [1] https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html [2] https://learn.microsoft.com/en-us/windows/win32/intl/locale-names From 6c0251c584edea64148604da52c8e55e43fe36e6 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Fri, 21 Apr 2023 14:03:57 -0700 Subject: [PATCH v3 1/4] ICU: do not convert locale 'C' to 'en-US-u-va-posix'. The conversion was intended to be for convenience, but it's more likely to be confusing than useful. The user can still directly specify 'en-US-u-va-posix' if desired. Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com --- src/backend/utils/adt/pg_locale.c | 19 +-- src/bin/initdb/initdb.c | 17 + .../regress/expected/collate.icu.utf8.out | 8 src/test/regress/sql/collate.icu.utf8.sql | 4 4 files changed, 14 insertions(+), 34 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 51df570ce9..58c4c426bc 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel) { #ifdef USE_ICU UErrorCode status; - char lang[ULOC_LANG_CAPACITY]; char *langtag; size_t buflen = 32; /* arbitrary starting buffer size */ const bool strict = true; - status = U_ZERO_ERROR; - uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) - { - if (elevel > 0) - ereport(elevel, - (errmsg("could not get language from locale \"%s\": %s", - loc_str, u_errorName(status; - return NULL; - } - - /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ - if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) - return pstrdup("en-US-u-va-posix"); - /* * A BCP47 language tag doesn't have a clearly-defined upper limit * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, @@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str) /* check for special language name */ if (strcmp(lang, "") == 0 || - strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 || - strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0) found = true; /* search for matching language within ICU */ diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 2c208ead01..4086834458 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str) { #ifdef USE_ICU UErrorCode status; - char lang[ULOC_LANG_CAPACITY]; char *langtag; size_t buflen = 32; /* arbitrary starting buffer size */ const bool strict = true; - status = U_ZERO_ERROR; - uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) - { - pg_fatal("could not get language from locale \"%s\": %s", - loc_str, u_errorName(status)); - return NULL; - } - - /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ - if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) - return pstrdup("en-US-u-va-posix"); - /* * A BCP47 language tag doesn't have a clearly-defined upper limit * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, @@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str) /* check for special language name */ if (strcmp(lang, "") == 0 || - strcmp(lang, "root") == 0 || strcmp(lang, "und") ==
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > Attached are a few small patches: > > 0001: don't convert C to en-US-u-va-posix > 0002: handle locale C the same regardless of the provider, as you > suggest above > 0003: make LOCALE (or --locale) apply to everything including ICU Testing this briefly I noticed two regressions 1) all pg_collation.collversion are empty due to a trivial bug in 0002: @ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const char *collcollate) { char *collversion = NULL; + if (pg_strcasecmp("C", collcollate) || + pg_strcasecmp("POSIX", collcollate)) + return NULL; + This should be pg_strcasecmp(...) == 0 2) The following works with HEAD (default provider=icu) but errors out with the patches: postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9 template 'template0'; ERROR: could not convert locale name "fr_FR@euro" to language tag: U_ILLEGAL_ARGUMENT_ERROR fr_FR@euro is a libc locale name $ locale -a|grep fr_FR fr_FR fr_FR@euro fr_FR.iso88591 fr_FR.iso885915@euro fr_FR.utf8 I understand that fr_FR@euro is taken as an ICU locale name, with the idea that the locale syntax being more or less compatible between both providers, this should work smoothly. 0003 seems to go further in the interpretation and fail on it. TBH the assumption that it's OK to feed libc locale names to ICU feels quite uncomfortable. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 22:35 +0100, Andrew Gierth wrote: > > > > > > Can lc_collate_is_c() be taught to check whether an ICU locale is > using > POSIX collation? Attached are a few small patches: 0001: don't convert C to en-US-u-va-posix 0002: handle locale C the same regardless of the provider, as you suggest above 0003: make LOCALE (or --locale) apply to everything including ICU As far as I can tell, any libc locale has a reasonable match in ICU, so setting LOCALE to either C or a libc locale name should be fine. Some locales are only valid in ICU, e.g. '@colStrength=primary', or a language tag representation, so if you do something like: create database foo locale 'en_US@colStrenghth=primary' template template0; You'll get a decent error like: ERROR: invalid LC_COLLATE locale name: "en_US@colStrenghth=primary" HINT: If the locale name is specific to ICU, use ICU_LOCALE. Overall, I think it works out nicely. Let me know if there are still some confusing cases. I tried a few variations and this one seemed the best, but I may have missed something. Regards, Jeff Davis From c768e040dc92b033e4eb0e69f08b59d8d1ffe1e4 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Fri, 21 Apr 2023 14:03:57 -0700 Subject: [PATCH v2 1/3] ICU: do not convert locale 'C' to 'en-US-u-va-posix'. The conversion was intended to be for convenience, but it's more likely to be confusing than useful. The user can still directly specify 'en-US-u-va-posix' if desired. Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com --- src/backend/utils/adt/pg_locale.c | 19 +-- src/bin/initdb/initdb.c | 17 + .../regress/expected/collate.icu.utf8.out | 8 src/test/regress/sql/collate.icu.utf8.sql | 4 4 files changed, 14 insertions(+), 34 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 51df570ce9..58c4c426bc 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel) { #ifdef USE_ICU UErrorCode status; - char lang[ULOC_LANG_CAPACITY]; char *langtag; size_t buflen = 32; /* arbitrary starting buffer size */ const bool strict = true; - status = U_ZERO_ERROR; - uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) - { - if (elevel > 0) - ereport(elevel, - (errmsg("could not get language from locale \"%s\": %s", - loc_str, u_errorName(status; - return NULL; - } - - /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ - if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) - return pstrdup("en-US-u-va-posix"); - /* * A BCP47 language tag doesn't have a clearly-defined upper limit * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, @@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str) /* check for special language name */ if (strcmp(lang, "") == 0 || - strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 || - strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0) found = true; /* search for matching language within ICU */ diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 2c208ead01..4086834458 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str) { #ifdef USE_ICU UErrorCode status; - char lang[ULOC_LANG_CAPACITY]; char *langtag; size_t buflen = 32; /* arbitrary starting buffer size */ const bool strict = true; - status = U_ZERO_ERROR; - uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) - { - pg_fatal("could not get language from locale \"%s\": %s", - loc_str, u_errorName(status)); - return NULL; - } - - /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ - if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) - return pstrdup("en-US-u-va-posix"); - /* * A BCP47 language tag doesn't have a clearly-defined upper limit * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, @@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str) /* check for special language name */ if (strcmp(lang, "") == 0 || - strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 || - strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0) found = true; /* search for matching language within ICU */ diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out index b5a221b030..99f12d2e73 100644 --- a/src/test/regress/expected/collate.icu.utf8.out +++ b/src/test/regress/expected/collate.icu.utf8.out @@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role; CREATE SCHEMA test_schema; --
Re: Order changes in PG16 since ICU introduction
"Daniel Verite" writes: > FTR the full text search parser still uses the libc functions > is[w]space/alpha/digit... that depend on lc_ctype, whether the db > collation provider is ICU or not. Yeah, those aren't even connected up to the collation-selection mechanisms; lots of work to do there. I wonder if they could be made to use regc_pg_locale.c instead of duplicating logic. regards, tom lane
Re: Order changes in PG16 since ICU introduction
Jeff Davis wrote: > > (I'm not sure whether those operations can get redirected to ICU > > today > > or whether they still always go to libc, but we'll surely want to fix > > it eventually if the latter is still true.) > > Those operations do get redirected to ICU today. FTR the full text search parser still uses the libc functions is[w]space/alpha/digit... that depend on lc_ctype, whether the db collation provider is ICU or not. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote: > I think I might like this idea, except for one thing: you're > imagining > that the locale doesn't control anything except string comparisons. > What about to_upper/to_lower, character classifications in regexes, > etc? If provider='libc' and LC_CTYPE='C', str_toupper/str_tolower are handled with asc_tolower/asc_toupper. The regex character classification is done with pg_char_properties. In these cases neither ICU nor libc is used; it's just code in postgres. libc is special in that you can set LC_COLLATE and LC_CTYPE separately, so that different locales are used for sorting and character classification. That's potentially useful to set LC_COLLATE to C for performance reasons, while setting LC_CTYPE to a useful locale. We don't allow ICU to set collation and ctype separately (it would be possible to allow it, but I don't think there's a huge demand and it's arguably inconsistent to set them differently). > (I'm not sure whether those operations can get redirected to ICU > today > or whether they still always go to libc, but we'll surely want to fix > it eventually if the latter is still true.) Those operations do get redirected to ICU today. There are extensions that call locale-sensitive libc functions directly, and obviously those won't use ICU. > Aside from the user-surprise issues discussed up to now, pg_dump > scripts > emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER > clauses in CREATE DATABASE, and people are going to be very unhappy > if that means they suddenly get totally different locale semantics > after restoring into a new DB. Agreed. > I think we need some plan for mapping > libc-style locale specs into ICU locales so that we can make that > more nearly transparent. ICU does a reasonable job mapping libc-like locale names to ICU locales, e.g. en_US to en-US, etc. The ordering semantics aren't guaranteed to be the same, of course (because the libc-locales are platform-dependent), but it's at least conceptually the same locale. > > Maybe this means we are not ready to do ICU-by-default in v16. > It certainly feels like there might be more here than we want to > start designing post-feature-freeze. This thread is already on the Open Items list. As long as it's not too disruptive to others I'll leave it as-is for now to see how this sorts out. Right now it's not clear to me how much of this is a v15 issue vs a v16 issue. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On 22.04.23 01:00, Jeff Davis wrote: On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote: And the fact that "C" or "POSIX" gets transformed into "en-US-u-va-posix" I already expressed, on reflection, that we should probably just not do that. So I think we're in agreement on this point; patch attached. This makes sense to me. This way, if someone specifies 'C' locale together with ICU provider they get an error. They can then choose to use the libc provider, to get the performance path, or stick with ICU by using the native spelling of the locale.
Re: Order changes in PG16 since ICU introduction
On 21.04.23 19:14, Peter Eisentraut wrote: On 21.04.23 19:09, Sandro Santilli wrote: On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote: "Regina Obe" writes: https://trac.osgeo.org/postgis/ticket/5375 If they actually are using locale C, I would say this is a bug. That should designate memcmp sorting and nothing else. Sounds like a bug to me. This is happening with a PostgreSQL cluster created and served by a build of commit c04c6c5d6f : =# select version(); PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit =# show lc_collate; C =# select '+' < '-'; f If the database is created with locale provider ICU, then lc_collate does not apply here, so the result might be correct (depending on what locale you have set). The GUC settings lc_collate and lc_ctype are from a time when those locale settings were cluster-global. When we made those locale settings per-database (PG 8.4), we kept them as read-only. As of PG 15, you can use ICU as the per-database locale provider, so what is being attempted in the above example is already meaningless before PG 16, since you need to look into pg_database to find out what is really happening. I think we should just remove the GUC parameters lc_collate and lc_ctype.
Re: Order changes in PG16 since ICU introduction
On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis wrote: > Most of the complaints seem to be complaints about v15 as well, and > while those complaints may be a reason to not make ICU the default, > they are also an argument that we should continue to learn and try to > fix those issues because they exist in an already-released version. > Leaving it the default for now will help us fix those issues rather > than hide them. > > It's still early, so we have plenty of time to revert the initdb > default if we need to. That's fair enough, but I really think it's important that some energy get invested in providing adequate documentation for this stuff. Just patching the code is not enough. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote: > And the fact that "C" or "POSIX" gets transformed into > "en-US-u-va-posix" I already expressed, on reflection, that we should probably just not do that. So I think we're in agreement on this point; patch attached. Regards, Jeff Davis From 3d2791af0a236cbc7ce7f29d988e8ac7fd3fd389 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Fri, 21 Apr 2023 14:03:57 -0700 Subject: [PATCH] ICU: do not convert locale 'C' to 'en-US-u-va-posix'. The conversion was intended to be for convenience, but it's more likely to be confusing than useful. The user can still directly specify 'en-US-u-va-posix' if desired. Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com --- src/backend/utils/adt/pg_locale.c | 19 +-- src/bin/initdb/initdb.c | 17 + .../regress/expected/collate.icu.utf8.out | 8 src/test/regress/sql/collate.icu.utf8.sql | 4 4 files changed, 14 insertions(+), 34 deletions(-) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 51df570ce9..58c4c426bc 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel) { #ifdef USE_ICU UErrorCode status; - char lang[ULOC_LANG_CAPACITY]; char *langtag; size_t buflen = 32; /* arbitrary starting buffer size */ const bool strict = true; - status = U_ZERO_ERROR; - uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) - { - if (elevel > 0) - ereport(elevel, - (errmsg("could not get language from locale \"%s\": %s", - loc_str, u_errorName(status; - return NULL; - } - - /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ - if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) - return pstrdup("en-US-u-va-posix"); - /* * A BCP47 language tag doesn't have a clearly-defined upper limit * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, @@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str) /* check for special language name */ if (strcmp(lang, "") == 0 || - strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 || - strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0) found = true; /* search for matching language within ICU */ diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 2c208ead01..4086834458 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str) { #ifdef USE_ICU UErrorCode status; - char lang[ULOC_LANG_CAPACITY]; char *langtag; size_t buflen = 32; /* arbitrary starting buffer size */ const bool strict = true; - status = U_ZERO_ERROR; - uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status); - if (U_FAILURE(status)) - { - pg_fatal("could not get language from locale \"%s\": %s", - loc_str, u_errorName(status)); - return NULL; - } - - /* C/POSIX locales aren't handled by uloc_getLanguageTag() */ - if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) - return pstrdup("en-US-u-va-posix"); - /* * A BCP47 language tag doesn't have a clearly-defined upper limit * (cf. RFC5646 section 4.4). Additionally, in older ICU versions, @@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str) /* check for special language name */ if (strcmp(lang, "") == 0 || - strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 || - strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0) + strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0) found = true; /* search for matching language within ICU */ diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out index b5a221b030..99f12d2e73 100644 --- a/src/test/regress/expected/collate.icu.utf8.out +++ b/src/test/regress/expected/collate.icu.utf8.out @@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role; CREATE SCHEMA test_schema; -- We need to do this this way to cope with varying names for encodings: SET client_min_messages TO WARNING; +SET icu_validation_level = disabled; do $$ BEGIN EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' || @@ -1034,17 +1035,24 @@ BEGIN quote_literal(current_setting('lc_collate')) || ');'; END $$; +RESET icu_validation_level; RESET client_min_messages; CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale" ERROR: parameter "locale" must be specified CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense" HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. +CREATE COLLATION testx (provider = icu, locale = 'c'); --
RE: Order changes in PG16 since ICU introduction
> > My opinion is that the switch to using ICU by default is ill-advised > > and should be reverted. > > Most of the complaints seem to be complaints about v15 as well, and while > those complaints may be a reason to not make ICU the default, they are also > an argument that we should continue to learn and try to fix those issues > because they exist in an already-released version. > Leaving it the default for now will help us fix those issues rather than hide > them. > > It's still early, so we have plenty of time to revert the initdb default if > we need > to. > > Regards, > Jeff Davis I'm fine with that. Sounds like it wouldn't be too hard to just pull it out at the end. Before this, I didn't even know ICU existed in PG15. My first realization that ICU was even a thing was when my PG16 refused to compile without adding my ICU path to my pkg-config or putting in --without-icu. So yah I suspect leaving it in a little bit longer will uncover some more issues and won't harm too much. Thanks, Regina
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote: > My opinion is that the switch to using ICU by default is ill-advised > and should be reverted. Most of the complaints seem to be complaints about v15 as well, and while those complaints may be a reason to not make ICU the default, they are also an argument that we should continue to learn and try to fix those issues because they exist in an already-released version. Leaving it the default for now will help us fix those issues rather than hide them. It's still early, so we have plenty of time to revert the initdb default if we need to. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
> "Jeff" == Jeff Davis writes: >> Is that the right fix, though? (It forces --locale-provider=libc for >> the cluster default, which might not be desirable?) Jeff> For the "no locale" behavior (memcmp()-based) the provider needs Jeff> to be libc. Do you see an alternative? Can lc_collate_is_c() be taught to check whether an ICU locale is using POSIX collation? There's now another bug in that --no-locale no longer does the same thing as --locale=C (which is its long-established documented behavior). How should these various options interact? This all seems not well thought out from a usability perspective, and I think a proper fix should involve a bit more serious consideration. -- Andrew.
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 22:08 +0100, Andrew Gierth wrote: > > > > > > Is that the right fix, though? (It forces --locale-provider=libc for > the > cluster default, which might not be desirable?) For the "no locale" behavior (memcmp()-based) the provider needs to be libc. Do you see an alternative? Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
> "Jeff" == Jeff Davis writes: >> Also, somewhere along the line someone broke initdb --no-locale, >> which should result in C locale being the default everywhere, but >> when I just tested it it picked 'en' for an ICU locale, which is not >> the right thing. Jeff> Fixed, thank you. Is that the right fix, though? (It forces --locale-provider=libc for the cluster default, which might not be desirable?) -- Andrew.
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote: > Maybe this means we are not ready to do ICU-by-default in v16. > It certainly feels like there might be more here than we want to > start designing post-feature-freeze. I don't see how punting to the next release helps. If the CREATE DATABASE syntax (and similar issues for createdb and initdb) in v15 is just too confusing, and we can't find a remedy for v16, then we probably won't find a remedy for v17 either. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote: > > > > > > Also, somewhere along the line someone broke initdb --no-locale, > which > should result in C locale being the default everywhere, but when I > just > tested it it picked 'en' for an ICU locale, which is not the right > thing. Fixed, thank you. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Fri, Apr 21, 2023 at 3:25 PM Jeff Davis wrote: > I am also having second thoughts about accepting "C" or "POSIX" as an > ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not > terribly useful (why not just use memcmp()?), it's not fast in my > measurements (en-US is faster), so maybe it's better to just throw an > error and tell the user to use C (or provider=none as I suggest > above)? I mean, to renew a complaint I've made previously, how the heck is anyone supposed to understand what's going on here? We have no meaningful documentation of how to select an ICU locale that works for you. We have a couple of examples and a suggestion that you should use BCP 47. But when I asked before for documentation references, the ones you provided were not clear, basically incomprehensible. In follow-up discussion, you admitted you'd had to consult the source code to figure certain things out. And the fact that "C" or "POSIX" gets transformed into "en-US-u-va-posix" is also completely documented. That string appears twice in the code, but zero times in the documentation. There's code to do it, but users shouldn't have to read code, and it wouldn't help much if they did, because the code comments don't really explain the rationale behind this choice either. I find the fact that people are having trouble here completely predictable. Of course if people ask for "C" and the system tells them that it's using "en-US-u-va-posix" instead they're going to be confused and ask questions, exactly as is happening here. glibc collations aren't particularly well-documented either, but people have some experience with, and they can get a list of values that have a chance of working from /usr/share/locale, and they know what "C" means. Nobody knows what "en-US-u-va-posix" is. It's not even Googleable, really, whereas "C locale" is. My opinion is that the switch to using ICU by default is ill-advised and should be reverted. The compatibility break isn't worth whatever advantages ICU may have, the documentation to allow people to transition to ICU with reasonable effort doesn't exist, and the fact that within weeks of feature freeze people who know a lot about PostgreSQL are struggling to get the behavior they want is a really bad sign. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 13:28 -0400, Tom Lane wrote: > I am wondering however whether this doesn't mean that all our > carefully > coded fast paths for C locale just went down the drain. The code still exists. You can test it by using the built-in collation "C" which is correctly specified with collprovider=libc and collcollate=C. For my test dataset, ICU 72, glibc 2.35: -- ~07s explain analyze select t from a order by t collate "C"; -- ~15s explain analyze select t from a order by t collate "en-US-x-icu"; -- ~21s explain analyze select t from a order by t collate "en-US-u-va-posix- x-icu"; -- ~34s explain analyze select t from a order by t collate "en_US"; I believe the confusion in this thread comes from: * The syntax of CREATE DATABASE (the same as v15 but still confusing) * The fact that you need provider=libc to get memcmp() behavior (same as v15 but still confusing) Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 21:14 +0200, Sandro Santilli wrote: > And then runs: > > createdb --encoding=UTF-8 --template=template0 --lc-collate=C > > Should we tweak anything else to make the results predictable ? You can specify --locale-provider=libc Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
Jeff Davis writes: > I have a couple ideas: > 1. Introduce a "none" provider to separate the concept of C/POSIX > locales from the libc provider. It's not really using a provider > anyway, it's just using memcmp(), and I think it causes confusion to > combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than > "LOCALE_PROVIDER=libc LOCALE='C'". I think I might like this idea, except for one thing: you're imagining that the locale doesn't control anything except string comparisons. What about to_upper/to_lower, character classifications in regexes, etc? (I'm not sure whether those operations can get redirected to ICU today or whether they still always go to libc, but we'll surely want to fix it eventually if the latter is still true.) In any case, that seems somewhat orthogonal to what we're on about here, which is making the behavior of CREATE DATABASE less surprising and more backwards-compatible. I'm not sure that provider=none can help with that. Aside from the user-surprise issues discussed up to now, pg_dump scripts emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER clauses in CREATE DATABASE, and people are going to be very unhappy if that means they suddenly get totally different locale semantics after restoring into a new DB. I think we need some plan for mapping libc-style locale specs into ICU locales so that we can make that more nearly transparent. > 2. Change the CREATE DATABASE syntax to catch these errors better at > the possible expense of backwards compatibility. That is the exact opposite of what I think we need. Backwards compatibility isn't optional. Maybe this means we are not ready to do ICU-by-default in v16. It certainly feels like there might be more here than we want to start designing post-feature-freeze. regards, tom lane
Re: Order changes in PG16 since ICU introduction
On Fri, 2023-04-21 at 14:23 -0400, Tom Lane wrote: > postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' > LOCALE = 'C'; ... > test1 | postgres | UTF8 | icu | C | > C | en-US | | > (4 rows) > > Looks like the "pick en-US even when told not to" problem exists here > too. Both provider (ICU) and the icu locale (en-US) are inherited from template0. The LOCALE parameter to CREATE DATABASE doesn't affect either of those things, because there's a separate parameter ICU_LOCALE. This happens the same way in v15, and although it matches the documentation technically, it is not a great user experience. I have a couple ideas: 1. Introduce a "none" provider to separate the concept of C/POSIX locales from the libc provider. It's not really using a provider anyway, it's just using memcmp(), and I think it causes confusion to combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than "LOCALE_PROVIDER=libc LOCALE='C'". 2. Change the CREATE DATABASE syntax to catch these errors better at the possible expense of backwards compatibility. I am also having second thoughts about accepting "C" or "POSIX" as an ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not terribly useful (why not just use memcmp()?), it's not fast in my measurements (en-US is faster), so maybe it's better to just throw an error and tell the user to use C (or provider=none as I suggest above)? Obviously the user could manually type "en-US-u-va-posix" if that's the locale they want. Throwing an error would be a backwards-compatibility issue, but in v15 an ICU locale of "C" just gives the root locale anyway, which is probably not what they want. Regards, Jeff Davis
Re: Order changes in PG16 since ICU introduction
On Fri, Apr 21, 2023 at 10:27:49AM -0700, Jeff Davis wrote: > On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote: > > =# select version(); > > PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > > 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit > > =# show lc_collate; > > C > > =# select '+' < '-'; > > f > > What is the result of: > > select datlocprovider, datcollate, daticulocale > from pg_database where datname=current_database(); datlocprovider | i datcollate | C daticulocale | en-US --strk;
Re: Order changes in PG16 since ICU introduction
> "Tom" == Tom Lane writes: >> Also, somewhere along the line someone broke initdb --no-locale, >> which should result in C locale being the default everywhere, but >> when I just tested it it picked 'en' for an ICU locale, which is not >> the right thing. Tom> Confirmed: Tom> $ LANG=en_US.utf8 initdb --no-locale Tom> The files belonging to this database system will be owned by user "postgres". Tom> This user must also own the server process. Tom> Using default ICU locale "en_US". Tom> Using language tag "en-US" for ICU locale "en_US". Tom> The database cluster will be initialized with this locale configuration: Tom> provider:icu Tom> ICU locale: en-US Tom> LC_COLLATE: C Tom> LC_CTYPE:C Tom> ... Tom> That needs to be fixed: --no-locale should prevent any Tom> consideration of initdb's LANG/LC_foo environment. Would it also not make sense to also take into account any --locale and --lc-* options before choosing an ICU default locale? Right now if you do, say, initdb --locale=fr_FR you get an ICU locale based on the environment but lc_* settings based on the option, which seems maximally confusing. Also, what happens now to lc_collate_is_c() when the provider is ICU? Am I missing something, or is it never true now, even if you specified C / POSIX / en-US-u-va-posix as the ICU locale? This seems like it could be an important pessimization. Also also, we now have the problem that it is much harder to create a 'C' collation database within an existing cluster (e.g. for testing) without knowing whether the default provider is ICU. In the past one would have done: CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C'; but now that creates a database that uses the same ICU locale as template0 by default. If instead one tries: CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' ICU_LOCALE='C'; then one gets an error if the default locale provider is _not_ ICU. The only option now seems to be: CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' LOCALE_PROVIDER = 'libc'; which of course doesn't work in older pg versions. -- Andrew.
Re: Order changes in PG16 since ICU introduction
On Fri, Apr 21, 2023 at 07:14:13PM +0200, Peter Eisentraut wrote: > On 21.04.23 19:09, Sandro Santilli wrote: > > On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote: > > > "Regina Obe" writes: > > > > > > > https://trac.osgeo.org/postgis/ticket/5375 > > > > > > If they actually are using locale C, I would say this is a bug. > > > That should designate memcmp sorting and nothing else. > > > > Sounds like a bug to me. This is happening with a PostgreSQL cluster > > created and served by a build of commit c04c6c5d6f : > > > >=# select version(); > >PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > > 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit > >=# show lc_collate; > >C > >=# select '+' < '-'; > >f > > If the database is created with locale provider ICU, then lc_collate does > not apply here, so the result might be correct (depending on what locale you > have set). The database is created by a perl script which starts like this: $ENV{"LC_ALL"} = "C"; $ENV{"LANG"} = "C"; And then runs: createdb --encoding=UTF-8 --template=template0 --lc-collate=C Should we tweak anything else to make the results predictable ? --strk;
Re: Order changes in PG16 since ICU introduction
"Regina Obe" writes: > CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C'; > Doesn't seem to work at least not under mingw64 anyway. Hmm, doesn't work for me either: $ LANG=en_US.utf8 initdb The files belonging to this database system will be owned by user "postgres". This user must also own the server process. Using default ICU locale "en_US". Using language tag "en-US" for ICU locale "en_US". The database cluster will be initialized with this locale configuration: provider:icu ICU locale: en-US LC_COLLATE: en_US.utf8 LC_CTYPE:en_US.utf8 LC_MESSAGES: en_US.utf8 LC_MONETARY: en_US.utf8 LC_NUMERIC: en_US.utf8 LC_TIME: en_US.utf8 ... $ psql postgres psql (16devel) Type "help" for help. postgres=# SELECT '+' < '-' ; ?column? -- f (1 row) (as expected, so far) postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C'; CREATE DATABASE postgres=# \c test1 You are now connected to database "test1" as user "postgres". test1=# SELECT '+' < '-' ; ?column? -- f (1 row) (wrong!) test1=# \l List of databases Name| Owner | Encoding | Locale Provider | Collate | Ctype| ICU Locale | ICU Rules | Access privileges ---+--+--+-++++---+--- postgres | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | | template0 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | | =c/postgres + | | | ||| | | postgres=CTc/postgres template1 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | | =c/postgres + | | | ||| | | postgres=CTc/postgres test1 | postgres | UTF8 | icu | C | C | en-US | | (4 rows) Looks like the "pick en-US even when told not to" problem exists here too. regards, tom lane
RE: Order changes in PG16 since ICU introduction
> Yeah. My recommendation is just LOCALE: > > regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = > 'UTF8' LOCALE = 'C'; CREATE DATABASE regression=# CREATE DATABASE test2 > TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C'; > NOTICE: using standard form "en-US-u-va-posix" for locale "C" > CREATE DATABASE > > I think it's probably intentional that ICU_LOCALE is stricter about being given > a real ICU locale name, but I didn't write any of that code. > > regards, tom lane CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C'; Doesn't seem to work at least not under mingw64 anyway. SELECT '+' < '-' ; Returns false
Re: Order changes in PG16 since ICU introduction
Andrew Gierth writes: > "Peter" == Peter Eisentraut writes: > Peter> If the database is created with locale provider ICU, then > Peter> lc_collate does not apply here, > Having lc_collate return a value which is silently being ignored seems > to me rather hugely confusing. It's not *completely* ignored --- there are bits of code that are not yet ICU-ified and will still use the libc facilities. So we can't get rid of those options yet, even in an ICU-based database. > Also, somewhere along the line someone broke initdb --no-locale, which > should result in C locale being the default everywhere, but when I just > tested it it picked 'en' for an ICU locale, which is not the right > thing. Confirmed: $ LANG=en_US.utf8 initdb --no-locale The files belonging to this database system will be owned by user "postgres". This user must also own the server process. Using default ICU locale "en_US". Using language tag "en-US" for ICU locale "en_US". The database cluster will be initialized with this locale configuration: provider:icu ICU locale: en-US LC_COLLATE: C LC_CTYPE:C ... That needs to be fixed: --no-locale should prevent any consideration of initdb's LANG/LC_foo environment. regards, tom lane
Re: Order changes in PG16 since ICU introduction
> "Peter" == Peter Eisentraut writes: Peter> If the database is created with locale provider ICU, then Peter> lc_collate does not apply here, Having lc_collate return a value which is silently being ignored seems to me rather hugely confusing. Also, somewhere along the line someone broke initdb --no-locale, which should result in C locale being the default everywhere, but when I just tested it it picked 'en' for an ICU locale, which is not the right thing. -- Andrew (irc:RhodiumToad)
Re: Order changes in PG16 since ICU introduction
"Regina Obe" writes: > Okay got it was on IRC with RhodiumToad and he suggested: > CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C' > LC_CTYPE = 'C' ICU_LOCALE='C'; > Which gives expected result: > SELECT '+' < '-' ; -- true > but gives me a notice: > NOTICE: using standard form "en-US-u-va-posix" for locale "C" Yeah. My recommendation is just LOCALE: regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C'; CREATE DATABASE regression=# CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C'; NOTICE: using standard form "en-US-u-va-posix" for locale "C" CREATE DATABASE I think it's probably intentional that ICU_LOCALE is stricter about being given a real ICU locale name, but I didn't write any of that code. regards, tom lane
RE: Order changes in PG16 since ICU introduction
> > CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' > LC_COLLATE = 'C' > > LC_CTYPE = 'C'; > > As has been pointed out already, setting LC_COLLATE/LC_CTYPE is > meaningless when the locale provider is ICU. You need to look at what ICU > locale is being chosen, or force it with LOCALE = 'C'. > > regards, tom lane Okay got it was on IRC with RhodiumToad and he suggested: CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C' LC_CTYPE = 'C' ICU_LOCALE='C'; Which gives expected result: SELECT '+' < '-' ; -- true but gives me a notice: NOTICE: using standard form "en-US-u-va-posix" for locale "C"