subject:"Order changes in PG16 since ICU introduction"


On 6/6/23 15:55, Tom Lane wrote:

Robert Haas  writes:

On Tue, Jun 6, 2023 at 3:25 PM Tom Lane  wrote:

Also +1, except that I find "none" a rather confusing choice of name.
There *is* a provider, it's just PG itself not either libc or ICU.
I thought Joe's suggestion of "internal" made more sense.



Or perhaps "builtin" or "postgresql".


Either OK by me


Same here

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

2023-06-06 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jun 6, 2023 at 3:25 PM Tom Lane  wrote:
>> Also +1, except that I find "none" a rather confusing choice of name.
>> There *is* a provider, it's just PG itself not either libc or ICU.
>> I thought Joe's suggestion of "internal" made more sense.

> Or perhaps "builtin" or "postgresql".

Either OK by me

regards, tom lane

Re: Order changes in PG16 since ICU introduction

2023-06-06 Thread Robert Haas

On Tue, Jun 6, 2023 at 3:25 PM Tom Lane  wrote:
> Joe Conway  writes:
> > On 6/6/23 15:18, Jeff Davis wrote:
> >> The locale "C" is a special case, documented as a non-locale. So, if
> >> LOCALE/--locale apply to ICU, then either ICU needs to handle locale
> >> "C" in the expected way (v8 patch series); or when we see locale "C" we
> >> need to somehow change the provider into something that can handle it
> >> (v6 patch series changes it to the "none" provider).
>
> > +1 to the latter approach
>
> Also +1, except that I find "none" a rather confusing choice of name.
> There *is* a provider, it's just PG itself not either libc or ICU.
> I thought Joe's suggestion of "internal" made more sense.

Or perhaps "builtin" or "postgresql".

I'm just thinking that "internal" as a type name kind of means "you
shouldn't be touching this from SQL" and we don't want to give people
the idea that the "C" locale isn't something you should use.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Order changes in PG16 since ICU introduction

2023-06-06 Thread Tom Lane

Joe Conway  writes:
> On 6/6/23 15:18, Jeff Davis wrote:
>> The locale "C" is a special case, documented as a non-locale. So, if
>> LOCALE/--locale apply to ICU, then either ICU needs to handle locale
>> "C" in the expected way (v8 patch series); or when we see locale "C" we
>> need to somehow change the provider into something that can handle it
>> (v6 patch series changes it to the "none" provider).

> +1 to the latter approach

Also +1, except that I find "none" a rather confusing choice of name.
There *is* a provider, it's just PG itself not either libc or ICU.
I thought Joe's suggestion of "internal" made more sense.

regards, tom lane

Re: Order changes in PG16 since ICU introduction


On 6/6/23 15:18, Jeff Davis wrote:

On Tue, 2023-06-06 at 15:09 +0200, Daniel Verite wrote:

FWIW I don't quite see how 0001 improve things or what problem it's
trying to solve.


The word "locale" is generic, so we need to make LOCALE/--locale apply
to whatever provider is being used. If "locale" only applies to libc,
using ICU will always be confusing and never be on the same level as
libc, let alone the preferred provider.



Agree 100%


The locale "C" is a special case, documented as a non-locale. So, if
LOCALE/--locale apply to ICU, then either ICU needs to handle locale
"C" in the expected way (v8 patch series); or when we see locale "C" we
need to somehow change the provider into something that can handle it
(v6 patch series changes it to the "none" provider).


+1 to the latter approach


Please let me know if you disagree with the goal or the reasoning here.
If so, please explain where you think we should end up, because the
status quo does not seem great to me.


also +1


0001 creates exceptions throughout the code so that when an ICU
collation has a locale name "C" or "POSIX" then it does not behave
like an ICU collation, even though pg_collation.collprovider='i'
To me it's neither desirable nor necessary that a collation that
has collprovider='i' is diverted to non-ICU semantics.


It's not very principled, but it matches what libc does.


Makes sense to me

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction


On 6/6/23 15:15, Jeff Davis wrote:

On Tue, 2023-06-06 at 14:11 -0400, Joe Conway wrote:

This discussion makes me wonder (though probably too late for the v16
cycle) if we shouldn't treat "C" and "POSIX" locales to be a third 
provider, something like "internal".


That's exactly what I did in v6 of this series: I created a "none"
provider, and when someone specified provider=icu iculocale=C, it would
change the provider to "none":

https://www.postgresql.org/message-id/5f9bf4a0b040428c5db2dc1f23cc3ad96acb5672.camel%40j-davis.com

I'm fine with either approach.


Ha!

Well it seems like I am +1 on that then ;-)

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

2023-06-06 Thread Jeff Davis

On Tue, 2023-06-06 at 14:11 -0400, Joe Conway wrote:
> This discussion makes me wonder (though probably too late for the v16
> cycle) if we shouldn't treat "C" and "POSIX" locales to be a third 
> provider, something like "internal".

That's exactly what I did in v6 of this series: I created a "none"
provider, and when someone specified provider=icu iculocale=C, it would
change the provider to "none":

https://www.postgresql.org/message-id/5f9bf4a0b040428c5db2dc1f23cc3ad96acb5672.camel%40j-davis.com

I'm fine with either approach.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction


On 6/6/23 09:09, Daniel Verite wrote:

Jeff Davis wrote:

New patch series attached. I plan to commit 0001 and 0002 soon, unless
there are objections.

0001 causes the "C" and "POSIX" locales to be treated with
memcmp/pg_ascii semantics in ICU, just like in libc. We also
considered a new "none" provider, but it's more invasive, and we can
always reconsider that in the v17 cycle.



0001 creates exceptions throughout the code so that when an ICU
collation has a locale name "C" or "POSIX" then it does not behave
like an ICU collation, even though pg_collation.collprovider='i'
To me it's neither desirable nor necessary that a collation that
has collprovider='i' is diverted to non-ICU semantics.


This discussion makes me wonder (though probably too late for the v16 
cycle) if we shouldn't treat "C" and "POSIX" locales to be a third 
provider, something like "internal".


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

2023-06-06 Thread Daniel Verite

Jeff Davis wrote:

> New patch series attached. I plan to commit 0001 and 0002 soon, unless
> there are objections.
> 
> 0001 causes the "C" and "POSIX" locales to be treated with
> memcmp/pg_ascii semantics in ICU, just like in libc. We also
> considered a new "none" provider, but it's more invasive, and we can
> always reconsider that in the v17 cycle.

FWIW I don't quite see how 0001 improve things or what problem it's
trying to solve.

0001 creates exceptions throughout the code so that when an ICU
collation has a locale name "C" or "POSIX" then it does not behave
like an ICU collation, even though pg_collation.collprovider='i'
To me it's neither desirable nor necessary that a collation that
has collprovider='i' is diverted to non-ICU semantics.

Also in the current state, this diversion does not apply to initdb.

"initdb --icu-locale=C" with 0001 applied reports this:

   Using language tag "en-US-u-va-posix" for ICU locale "C".
   The database cluster will be initialized with this locale configuration:
 provider:icu
 ICU locale:  en-US-u-va-posix
 LC_COLLATE:  fr_FR.UTF-8
 [...]

and "initdb --locale=C" reports this:

   Using default ICU locale "fr_FR".
   Using language tag "fr-FR" for ICU locale "fr_FR".
   The database cluster will be initialized with this locale configuration:
 provider:icu
 ICU locale:  fr-FR
 LC_COLLATE:  C
 [...]

Could you elaborate a bit more on what 0001 is meant to achieve, from
the point of view of the user?

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

2023-05-26 Thread Daniel Verite

Jeff Davis wrote:

> > #1
> > 
> > postgres=# create database test1 locale='fr_FR.UTF-8';
> > NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
> > ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of
> 
> I don't see a problem here. If you specify LOCALE to CREATE DATABASE,
> you should either be using "TEMPLATE template0", or you should be
> expecting an error if the LOCALE doesn't match exactly.
> 
> What would you like to see happen here?

What's odd is that initdb starting in an fr_FR.UTF-8 environment
found that "fr" was the default ICU locale to use, whereas
"create database" reports that "fr" and "fr_FR.UTF-8" refer to
incompatible locales.

To me initdb is wrong when coming up with the less precise "fr"
instead of "fr-FR".

I suggest the attached patch to call uloc_getDefault() instead of the
current code that somehow leaves out the country/region component.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 31156e863b..09a5c98cc0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2354,42 +2354,13 @@ icu_validate_locale(const char *loc_str)
 }
 
 /*
- * Determine default ICU locale by opening the default collator and reading
- * its locale.
- *
- * NB: The default collator (opened using NULL) is different from the collator
- * for the root locale (opened with "", "und", or "root"). The former depends
- * on the environment (useful at initdb time) and the latter does not.
+ * Determine the default ICU locale
  */
 static char *
 default_icu_locale(void)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-	const char *valid_locale;
-	char	   *default_locale;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(NULL, &status);
-	if (U_FAILURE(status))
-		pg_fatal("could not open collator for default locale: %s",
- u_errorName(status));
-
-	status = U_ZERO_ERROR;
-	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
-		&status);
-	if (U_FAILURE(status))
-	{
-		ucol_close(collator);
-		pg_fatal("could not determine default ICU locale");
-	}
-
-	default_locale = pg_strdup(valid_locale);
-
-	ucol_close(collator);
-
-	return default_locale;
+	return pg_strdup(uloc_getDefault());
 #else
 	pg_fatal("ICU is not supported in this build");
 #endif

Re: Order changes in PG16 since ICU introduction

2023-05-25 Thread Jeff Davis

On Mon, 2023-05-22 at 14:34 +0200, Peter Eisentraut wrote:
> Please put blank lines between
> 
> 
> 
> 
> etc., matching existing style.
> 
> We usually don't capitalize the collation parameters like
> 
> CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
> 
> elsewhere in the documentation.
> 
> Table 24.2. ICU Collation Settings should probably be sorted by key,
> or 
> at least by something.
> 
> All tables should referenced in the text, like "Table x.y shows this
> and 
> that."  (Note that a table could float to a different page in some 
> output formats, so just putting it into a section without some 
> introductory text isn't sound.)

Thank you, done.

> Table 24.1. ICU Collation Levels shows punctuation as level 4, which
> is 
> only true in shifted mode, which isn't the default.  The whole
> business 
> of treating variable collation elements is getting a bit lost in this
> description.  The kv option is described as "Classes of characters 
> ignored during comparison at level 3.", which is effectively true but
> not the whole picture.

I organized the documentation around practical examples and available
options, and less around the conceptual model. I think that's a good
start, but you're right that it over-simplifies in a few areas.

Discussing the model would work better along with an explanation of ICU
rules, where you can make better use of those concepts. I feel like
there are some interesting things that can be done with rules, but I
haven't had a chance to really dig in yet.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-24 Thread Joe Conway


On 5/24/23 11:39, Jeff Davis wrote:

On Mon, 2023-05-22 at 22:09 +0200, Daniel Verite wrote:

In practice we're probably getting the "und" ICU locale whereas "fr"
would be appropriate.


This is a good point and illustrates that ICU is not a drop-in
replacement for libc in all cases.

I don't see a solution here that doesn't involve some rough edges,
though. "Locale" is a generic term, and if we continue to insist that
it really means a libc locale, then ICU will never be on an equal
footing with libc, let alone the preferred provider.


Huge +1

IMHO the experience should be unified to the degree possible.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

2023-05-24 Thread Jeff Davis

On Mon, 2023-05-22 at 22:09 +0200, Daniel Verite wrote:
> While I agree that the LOCALE option in CREATE DATABASE is
> counter-intuitive,

I think it's more than that. As Andreww Gierth pointed out:

   $ initdb --locale=fr_FR
   ...
 ICU locale:  en-US
   ...

Is more than just counter-intuitive. I don't think we can ship 16 that
way.

>  I find it questionable that blending ICU
> and libc locales into it helps that much with the user experience.

Thank you for going through some examples here. I agree that it's not
perfect, but we need some path to a reasonable ICU user experience, and
I think we'll have to accept some rough edges to avoid the worst cases,
like above.

> initdb:
> 
>   Using default ICU locale "fr".
>   Using language tag "fr" for ICU locale "fr".
> 

...

> #1
> 
> postgres=# create database test1 locale='fr_FR.UTF-8';
> NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
> ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of

I don't see a problem here. If you specify LOCALE to CREATE DATABASE,
you should either be using "TEMPLATE template0", or you should be
expecting an error if the LOCALE doesn't match exactly.

What would you like to see happen here?

> #2
> 
> postgres=# create database test2 locale='C.UTF-8'
> template='template0';
> NOTICE:  using standard form "en-US-u-va-posix" for ICU locale
> "C.UTF-8"
> CREATE DATABASE
> 
> 
> en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so
> this interpretation is arguably not what a user would expect.

As you pointed out, this is not settled in libc either:

https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f%40manitou-mail.org

We really can't expect a particular order for a particular locale name,
unless we handle it specially like "C" or "POSIX". If we pass it to the
provider, we have to trust the provider to match our conceptual
expectations for that locale (and ideally version it properly).

> I would expect the ICU warning or error (icu_validation_level) to
> kick
> in instead of that transliteration.

Earlier versions of ICU (<= 63) do this transformation automatically,
and I don't see a reason to throw an error if ICU considers it valid.
The language tag en-US-u-va-posix will be stored in the catalog, and
that will be considered valid in later versions of ICU.

Later versions of ICU (>= 64) consider locales with a language name of
"C" to be obsolete and no longer valid. I added code to do the
transformation without error in these later versions, but I think we
have agreement to remove it.

If a user specifies the locale as "C.UTF-8", we can either pass it to
ICU and see whether that version accepts it or not (and if not, throw a
warning/error); or if we decide that "C.UTF-8" really means "C", we can
handle it in the memcmp() path like C and never send it to ICU.

> #3
> 
> $ grep french /etc/locale.alias
> french  fr_FR.ISO-8859-1
> 
> postgres=# create database test3 locale='french' template='template0'
> encoding='LATIN1';
> WARNING:  ICU locale "french" has unknown language "french"
> HINT:  To disable ICU locale validation, set parameter
> icu_validation_level
> to DISABLED.
> CREATE DATABASE
> 
> 
> In practice we're probably getting the "und" ICU locale whereas "fr"
> would
> be appropriate.

This is a good point and illustrates that ICU is not a drop-in
replacement for libc in all cases.

I don't see a solution here that doesn't involve some rough edges,
though. "Locale" is a generic term, and if we continue to insist that
it really means a libc locale, then ICU will never be on an equal
footing with libc, let alone the preferred provider.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-22 Thread Daniel Verite

Jeff Davis wrote:

> If we special case locale=C, but do nothing for locale=fr_FR, then I'm
> not sure we've solved the problem. Andrew Gierth raised the issue here,
> which he called "maximally confusing":
> 
> https://postgr.es/m/874jp9f5jo@news-spur.riddles.org.uk
> 
> That's why I feel that we need to make locale apply to whatever the
> provider is, not just when it happens to be C.

While I agree that the LOCALE option in CREATE DATABASE is
counter-intuitive, I find it questionable that blending ICU
and libc locales into it helps that much with the user experience.

Trying the lastest v6-* patches applied on top of 722541ead1
(before the pgindent run), here are a few examples when I
don't think it goes well.

The OS is Ubuntu 22.04 (glibc 2.35, ICU 70.1)

initdb:

  Using default ICU locale "fr".
  Using language tag "fr" for ICU locale "fr".
  The database cluster will be initialized with this locale configuration:
provider:icu
ICU locale:  fr
LC_COLLATE:  fr_FR.UTF-8
LC_CTYPE:fr_FR.UTF-8
LC_MESSAGES: fr_FR.UTF-8
LC_MONETARY: fr_FR.UTF-8
LC_NUMERIC:  fr_FR.UTF-8
LC_TIME: fr_FR.UTF-8
  The default database encoding has accordingly been set to "UTF8".

#1

postgres=# create database test1 locale='fr_FR.UTF-8';
NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of the
template database (fr)
HINT:  Use the same ICU locale as in the template database, or use template0
as template.

That looks like a fairly generic case that doesn't work seamlessly.

#2

postgres=# create database test2 locale='C.UTF-8' template='template0';
NOTICE:  using standard form "en-US-u-va-posix" for ICU locale "C.UTF-8"
CREATE DATABASE

en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so
this interpretation is arguably not what a user would expect.

I would expect the ICU warning or error (icu_validation_level) to kick
in instead of that transliteration.

#3

$ grep french /etc/locale.alias
french  fr_FR.ISO-8859-1

postgres=# create database test3 locale='french' template='template0'
encoding='LATIN1';
WARNING:  ICU locale "french" has unknown language "french"
HINT:  To disable ICU locale validation, set parameter icu_validation_level
to DISABLED.
CREATE DATABASE

In practice we're probably getting the "und" ICU locale whereas "fr" would
be appropriate.

I assume that we would find more cases like that if testing on many
operating systems.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

2023-05-22 Thread Jeff Davis

On Mon, 2023-05-22 at 14:27 +0200, Peter Eisentraut wrote:
> The rules are for setting whatever sort order you like.  Maybe you
> want 
> to sort + before - or whatever.  It's like, if you don't like it,
> build 
> your own.

A build-your-own feature is fine, but it's not completely zero cost.

There some risk that rules specified for ICU version X fail to load for
ICU version Y. If that happens to your database default collation, you
are in big trouble. The risk of failing to load a language tag in a
later version, especially one returned by uloc_toLanguageTag() in
strict mode, is much lower. We can reduce the risk by allowing rules
only for CREATE COLLATION (not  CREATE DATABASE), and see what users do
with it first, and consider adding it to CREATE DATABASE later.

We can also try to explain in the docs that it's a build-it-yourself
kind of feature (use it if you see a purpose, otherwise ignore it),
though I'm not sure quite how we should word it.

And I'm skeptical that we don't have a single plausible end-to-end user
story. I just can't think of any reason someone would need something
like this, given how flexible the collation settings in the language
tags are. The best case I can think of is if someone is trying to make
an ICU collation that matches some non-ICU collation in another system,
which sounds hard; but perhaps it's reasonable to do in cases where it
just needs to work well-enough in some limited case.

Also, do we have an answer as to why specifying the rules as '' is not
the same as not specifying any rules[1]?

[1]
https://www.postgresql.org/message-id/36a6e89689716c2ca1fae8adc8e84601a041121c.ca...@j-davis.com

> The co settings are just everything else. 
> They are not parametric, they are just some other sort order that 
> someone spelled out explicitly.

This sounds like another case where we can't really tell the user why
they would want to use a specific "co" setting; they should only use it
if they already know they want it. Is there some way we can word that
in the documentation so that people don't misuse them?

For instance, one of them is called "emoji". I'm sure a lot of
applications use emoji (or at least might encounter them), should they
always use co-emoji, or would some people who are using emoji not want
it? Can it be combined with "ks" or other "k*" settings?

What I'm trying to avoid is users seeing something in the documentation
and using it without it really being a good fit for their problem. Then
they see something unexpected, and need to rebuild all of their indexes
or something.

> > * I don't understand what "kc" means if "ks" is not set to
> > "level1".
> 
> There is an example here: 
> https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings#colcaselevel

Interesting, thank you.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-22 Thread Jeff Davis

On Thu, 2023-05-11 at 13:07 +0200, Peter Eisentraut wrote:
> Here is my proposed patch for this.

The commit message makes it sound like lc_collate/ctype are completely
obsolete, and I don't think that's quite right: they still represent
the server environment, which does still matter in some cases.

I'd just say that they are too confusing (likely to be misused), and
becoming obsolete (or less relevant), or something along those lines.

Otherwise, this is fine with me. I didn't do a detailed review because
it's just mechanical.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-22 Thread Jeff Davis

On Thu, 2023-05-11 at 13:09 +0200, Peter Eisentraut wrote:
> There is also the deterministic flag and the icurules setting. 
> Depending on what level of detail you imagine the user needs, you
> really 
> do need to look at the whole picture, not some subset of it.

(Nit: all database default collations are deterministic.)

I agree, but I think there should be a way to see the whole picture in
one command. If nothing else, for repro cases sent to the list, it
would be nice to have a single line like:

   SELECT show_default_collation_whole_picture();

Right now it involves some back and forth, like checking
datlocprovider, then looking in the right fields and ignoring the wrong
ones.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-22 Thread Peter Eisentraut


On 18.05.23 00:59, Jeff Davis wrote:

On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:

Other than that, and I took your suggestions almost verbatim. Patch
attached. Thank you!


Attached new patch with a typo fix and a few other edits. I plan to
commit soon.


Some small follow-up on this patch:

Please put blank lines between




etc., matching existing style.

We usually don't capitalize the collation parameters like

CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);

elsewhere in the documentation.

Table 24.2. ICU Collation Settings should probably be sorted by key, or 
at least by something.


All tables should referenced in the text, like "Table x.y shows this and 
that."  (Note that a table could float to a different page in some 
output formats, so just putting it into a section without some 
introductory text isn't sound.)


Table 24.1. ICU Collation Levels shows punctuation as level 4, which is 
only true in shifted mode, which isn't the default.  The whole business 
of treating variable collation elements is getting a bit lost in this 
description.  The kv option is described as "Classes of characters 
ignored during comparison at level 3.", which is effectively true but 
not the whole picture.

Re: Order changes in PG16 since ICU introduction

2023-05-22 Thread Peter Eisentraut


On 18.05.23 19:55, Jeff Davis wrote:

On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:

I did a quicker read through this time. LGTM overall. I like what you
did with the explanations around sensitivity (now it makes sense).


Committed, thank you.

There are a few things I don't understand that would be good to
document better:

* Rules. I still don't quite understand the use case: are these for
people inventing new languages? What is a plausible use case that isn't
covered by the existing locales and collation settings? Do rules make
sense for a database default collation? Are they for language experts
only or might an ordinary developer benefit from using them?


The rules are for setting whatever sort order you like.  Maybe you want 
to sort + before - or whatever.  It's like, if you don't like it, build 
your own.



* The collation types "phonebk", "emoji", etc.: are these variants of
particular locales, or do they make sense in multiple locales? I don't
know where they fit in or how to document them.


The k* settings are parametric settings, in that they transform the sort 
key in some algorithmic way.  The co settings are just everything else. 
They are not parametric, they are just some other sort order that 
someone spelled out explicitly.



* I don't understand what "kc" means if "ks" is not set to "level1".


There is an example here: 
https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings#colcaselevel

Re: Order changes in PG16 since ICU introduction

2023-05-20 Thread Jeff Davis

On Fri, 2023-05-19 at 21:13 +0200, Daniel Verite wrote:
> ISTM that if we want to go that route, we need the make the minimum
> changes at the user interface level and not any deeper, so that when
> (locale="C" OR locale="POSIX") AND the provider has not been
> specified,
> then the command (initdb and create database) act as if the user had
> specified provider=libc.

If we special case locale=C, but do nothing for locale=fr_FR, then I'm
not sure we've solved the problem. Andrew Gierth raised the issue here,
which he called "maximally confusing":

https://postgr.es/m/874jp9f5jo@news-spur.riddles.org.uk

That's why I feel that we need to make locale apply to whatever the
provider is, not just when it happens to be C.

> > (3) Support iculocale=C in the ICU provider using the memcmp()
> > path.
> 
> > In other words, if provider=icu and iculocale=C, lc_collate_is_c()
> > and
> > lc_ctpye_is_c() would both return true.
> 
> ICU does not provide a locale that behaves like that, and it doesn't
> feel right to pretend it does. It feels like attacking the problem
> at the wrong level.

I agree that #3 feels slightly wrong, but I think it's still a viable
option until we have consensus on something better.

> > (4) Create a new "none" provider (which has no locale and always
> > memcmp
> > semantics), and automatically change the provider to "none" if
> > provider=icu and iculocale=C.
> 
> It still uses libc/C for character classification and case changing,
> so "no locale" is technically not true.

The provider affects callers that have a pg_locale_t, such as the SQL-
callable lower() function. For those callers, the "none" provider uses
pg_ascii_tolower(), etc., not libc. That's why I called it "none" --
it's using simple internal postgres implementations instead of a
provider.

For callers that don't have a pg_locale_t, they may call libc functions
directly and rely on the server environment. But in those cases,
there's no way to set a provider at all, it's just relying on the
server environment. There aren't many of these cases, and hopefully we
can eliminate the reliance on the server environment over time.

If I'm missing something, let me know what cases you have in mind.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-19 Thread Tom Lane

Jeff Davis  writes:
> Committed, thank you.

This commit has given the PDF docs build some indigestion:

Making portrait pages on A4 paper (210mmx297mm)
/home/postgres/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf
[WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting with 
"Symbol,normal,400".
[WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. Substituting 
with "ZapfDingbats,normal,400".
[WARN] FOUserAgent - Hyphenation pattern not found. URI: en.
[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area 
in the inline-progression direction by 3531 millipoints. (See position 
55117:2388)
[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area 
in the inline-progression direction by 1871 millipoints. (See position 
55117:12998)
[WARN] FOUserAgent - Glyph "?" (0x323, dotbelowcmb) not available in font 
"Courier".
[WARN] FOUserAgent - Glyph "?" (0x302, circumflexcmb) not available in font 
"Courier".
[WARN] FOUserAgent - The contents of fo:block line 12 exceed the available area 
in the inline-progression direction by 20182 millipoints. (See position 
55172:188)
[WARN] FOUserAgent - The contents of fo:block line 10 exceed the available area 
in the inline-progression direction by 17682 millipoints. (See position 
55172:188)
[WARN] FOUserAgent - Glyph "?" (0x142, lslash) not available in font 
"Times-Roman".
[WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value found 
on the parent FO.

(The first three and last one warnings are things we've been living
with, but the ones between are new.)

The first two "exceed the available area" complaints are in the "ICU
Collation Levels" table.  We can silence them by providing some column
width hints to make the "Description" column a tad wider than the rest,
as in the proposed patch attached.  The other two, as well as the first
two glyph-not-available complaints, are caused by this bit:

   Full normalization is important in some cases, such as when
   multiple accents are applied to a single character. For instance,
   'ệ' can be composed of code points
   U&'\0065\0323\0302' or
   U&'\0065\0302\0323'. With full normalization
   on, these code point sequences are treated as equal; otherwise they
   are unequal.

which renders just abysmally (see attached screen shot), and even in HTML
where it's rendering about as intended, it really is an unintelligible
example.  Few people are going to understand that the circumflex and the
dot-below are separately applied accents.  How about we drop the explicit
example and write something like

   Full normalization allows code point sequences such as
   characters with multiple accent marks applied in different
   orders to be seen as equal.

?

(The last missing-glyph complaint is evidently from the release notes;
I'll bug Bruce about that separately.)

regards, tom lane

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 9db14649aa..96a23bf530 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1140,6 +1140,14 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
  
   ICU Collation Levels
   
+   
+   
+   
+   
+   
+   
+   
+   

 
  Level

Re: Order changes in PG16 since ICU introduction

2023-05-19 Thread Daniel Verite

Jeff Davis wrote:

> 2) Automatically change the provider to libc when locale=C.
> 
> Almost works, but it's not clear how we handle the case "provider=icu
> lc_collate='fr_FR.utf8' locale=C".
> 
> If we change it to "provider=libc lc_collate=C", we've overridden the
> specified lc_collate. If we ignore the locale=C, that would be
> surprising to users. If we throw an error, that would be a backwards
> compatibility issue.

This thread started with a report illustrating that when users mention
the locale "C", they implicitly mean "C" from the libc provider, as
when libc was the default. The problem is that as soon as ICU is the
default, any reference to a libc collation should mention explicitly
that the provider is libc.

It seems what we're set on the idea to create an exception for "C"
(and I assume also "POSIX") to avoid too much confusion, and because
"C" is quite special anyway, and has no equivalent in ICU (the switch
in v16 to ICU as the default provider is based on the premise that the
locales with the same name will behave pretty much the same with ICU
as they did with libc, but it's absolutely not the case with "C").

ISTM that if we want to go that route, we need the make the minimum
changes at the user interface level and not any deeper, so that when
(locale="C" OR locale="POSIX") AND the provider has not been specified,
then the command (initdb and create database) act as if the user had
specified provider=libc.

> (3) Support iculocale=C in the ICU provider using the memcmp() path.

> In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
> lc_ctpye_is_c() would both return true.

ICU does not provide a locale that behaves like that, and it doesn't
feel right to pretend it does. It feels like attacking the problem
at the wrong level.

> (4) Create a new "none" provider (which has no locale and always memcmp
> semantics), and automatically change the provider to "none" if
> provider=icu and iculocale=C.

It still uses libc/C for character classification and case changing,
so "no locale" is technically not true. Personally I don't see
the benefit of adding a "none" provider. C is a libc locale
and libc is not disappearing. I also think  that when users explicitly
indicate provider=icu, they should get icu.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

2023-05-18 Thread Jeff Davis

On Thu, 2023-05-18 at 20:11 +0200, Matthias van de Meent wrote:
> As I complain about in [0], since 5cd1a5af --no-locale has been
> broken
> / bahiving outside it's description: Instead of being equivalent to
> `--locale=C` it now also overrides `--locale-provider=libc`,
> resulting
> in undocumented behaviour.

I agree that 5cd1a5af is incomplete.

Posting updated patches. Feedback on the approaches below would be
appreciated.

For context, in version 15:

  $ initdb -D data --locale-provider=icu --icu-locale=en
  => create database clocale template template0 locale='C';
  => select datname, datlocprovider, daticulocale
 from pg_database where datname='clocale';
   datname | datlocprovider | daticulocale 
  -++--
   clocale | i  | en
  (1 row)

That behavior is confusing, and when I made ICU the default provider in
v16, the confusion was extended into more cases.

If we leave the CREATE DATABASE (and createdb and initdb) syntax in
place, such that LOCALE (and --locale) do not apply to ICU at all, then
I don't see a path to a good ICU user experience.

Therefore I conclude that we need LOCALE (and --locale) to apply to ICU
somehow. (The LOCALE option already applies to ICU during CREATE
COLLATION, just not CREATE DATABASE or initdb.)

Patch 0003 does this. It's fairly straightforward and I believe we need
this patch.

But to actually fix your complaint we also need --no-locale to be
equivalent to --locale=C and for those options to both use memcmp()
semantics. There are several approaches to accomplish this, and I think
this is the part where I most need some feedback. There are only so
many approaches, and each one has some potential downsides, but I
believe we need to select one:

(1) Give up and leave the existing CREATE DATABASE (and createdb, and
initdb) semantics in place, along with the confusing behavior in v15.

This is a last resort, in my opinion. It gives us no path toward a good
user experience with ICU, and leaves us with all of the problems of the
OS as a collation provider.

(2) Automatically change the provider to libc when locale=C.

Almost works, but it's not clear how we handle the case "provider=icu
lc_collate='fr_FR.utf8' locale=C".

If we change it to "provider=libc lc_collate=C", we've overridden the
specified lc_collate. If we ignore the locale=C, that would be
surprising to users. If we throw an error, that would be a backwards
compatibility issue.

One possible solution would be to change the catalog representation to
allow setting the default collation locale separately from datcollate
even for the libc provider. For instance, rename daticulocale to
datdeflocale, and store the default collation locale there for both
libc and ICU. Then, "provider=icu lc_collate='fr_FR.utf8' locale=C"
could be changed into "provider=libc lc_collate='fr_FR.utf8'
deflocale=C". It may be confusing that datcollate is a different
concept from datdeflocale; but then again they are different concepts
and it's confusing that they are currently combined into one.

(3) Support iculocale=C in the ICU provider using the memcmp() path.

In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
lc_ctpye_is_c() would both return true.

There's a potential problem for users who've misused ICU in the past
(15 or earlier) by using provider=icu and iculocale=C. ICU would accept
such a locale name, but not recognize it and fall back to the root
locale, so it never worked as the user intended it. But if we redefine
C to be memcmp(), then such users will have broken indexes if they
upgrade.

We could add a check at pg_upgrade time for iculocale=C in versions 15
and earlier, and cause the check (and therefore the upgrade) to fail.
That may be reasonable considering that it never really worked in the
past, and perhaps very few users actually ever created such a
collation. But if some user runs into that problem, we'd have to resort
to a hack like telling them to "update pg_collation set iculocale='und'
where iculocale='C'" and then try the upgrade again, which is not a
great answer (as far as I can tell it would be a correct answer and
should not break their indexes, but it feels pretty dangerous).

There may be some other resolutions to this problem, such as catalog
hacks that allow for different representations of iculocale=C pre-16
and post-16. That doesn't sound great though, and we'd have to figure
out what to do with pg_dump.

(4) Create a new "none" provider (which has no locale and always memcmp
semantics), and automatically change the provider to "none" if
provider=icu and iculocale=C.

This solves the problem case in #2 and the potential upgrade problem in
#3. It also makes the documentation a bit more natural, in my opinion,
even if we retain the special case for provider=libc collate=C.

#4 is the approach I chose (patches 0001 and 0002), but I'd like to
hear what others think.

For historical reasons, users may assume that LC_COLLATE control

Re: Order changes in PG16 since ICU introduction

2023-05-18 Thread Jeff Davis

On Thu, 2023-05-18 at 13:58 -0400, Jonathan S. Katz wrote:
>  From my read of them, as an app developer I'd be very unlikely to
> use 
> this. Maybe there is something with building out some collation rules
> vis-a-vis an extension, but I have trouble imagining the use-case. I
> may 
> also not be the target audience for this feature.

That's a problem for the ICU rules feature. I understand some features
may be for domain experts only, but we at least need to call that out
so that ordinary developers don't get confused. And we should hear from
some of those domain experts that they actually want it and it solves a
real problem.

For the features that can be described with collation
settings/attributes right in the locale name, the use cases are more
plausible and we've supported them since v10, so it's good to document
them as best we can. It's hard to expose only the particular ICU
collation settings we understand best (e.g. the "ks" setting that
allows case insensitive collation), so it's inevitable that there will
be some settings that are more obscure and harder to document.

But in the case of ICU rules, they are newly-supported in 16, so there
should be a clear reason we're adding them. Otherwise we're just
setting up users for confusion or problems, and creating backwards-
compatibility headaches for ourselves (and the last thing we want is to
fret over backwards compatibility for a feature with no users).

Beyond that, there seems to be some danger: if the syntax for rules is
not perfectly compatible between ICU versions, the user might run into
big problems.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-18 Thread Matthias van de Meent

On Fri, 21 Apr 2023 at 22:46, Jeff Davis  wrote:
>
> On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote:
> > > > > >
> > Also, somewhere along the line someone broke initdb --no-locale,
> > which
> > should result in C locale being the default everywhere, but when I
> > just
> > tested it it picked 'en' for an ICU locale, which is not the right
> > thing.
>
> Fixed, thank you.

As I complain about in [0], since 5cd1a5af --no-locale has been broken
/ bahiving outside it's description: Instead of being equivalent to
`--locale=C` it now also overrides `--locale-provider=libc`, resulting
in undocumented behaviour.

Kind regards,

Matthias van de Meent
Neon, Inc.

[0] 
https://www.postgresql.org/message-id/CAEze2WiZFQyyb-DcKwayUmE4rY42Bo6kuK9nBjvqRHYxUYJ-DA%40mail.gmail.com

Re: Order changes in PG16 since ICU introduction

2023-05-18 Thread Jonathan S. Katz


On 5/18/23 1:55 PM, Jeff Davis wrote:

On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:

I did a quicker read through this time. LGTM overall. I like what you
did with the explanations around sensitivity (now it makes sense).


Committed, thank you.


\o/


There are a few things I don't understand that would be good to
document better:

* Rules. I still don't quite understand the use case: are these for
people inventing new languages? What is a plausible use case that isn't
covered by the existing locales and collation settings? Do rules make
sense for a database default collation? Are they for language experts
only or might an ordinary developer benefit from using them?


From my read of them, as an app developer I'd be very unlikely to use 
this. Maybe there is something with building out some collation rules 
vis-a-vis an extension, but I have trouble imagining the use-case. I may 
also not be the target audience for this feature.



* The collation types "phonebk", "emoji", etc.: are these variants of
particular locales, or do they make sense in multiple locales? I don't
know where they fit in or how to document them.


I remember I had a exploratory use case for "phonebk" but I couldn't 
figure out how to get it to work. AIUI from random searching, the idea 
is that it provides the "phonebook" rules for ordering "names" in a 
particular locale, but I couldn't get it to work.



* I don't understand what "kc" means if "ks" is not set to "level1".


Me neither, but I haven't stared at this as hard as others.

Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: Order changes in PG16 since ICU introduction

2023-05-18 Thread Jeff Davis

On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:
> I did a quicker read through this time. LGTM overall. I like what you
> did with the explanations around sensitivity (now it makes sense).

Committed, thank you.

There are a few things I don't understand that would be good to
document better:

* Rules. I still don't quite understand the use case: are these for
people inventing new languages? What is a plausible use case that isn't
covered by the existing locales and collation settings? Do rules make
sense for a database default collation? Are they for language experts
only or might an ordinary developer benefit from using them?

* The collation types "phonebk", "emoji", etc.: are these variants of
particular locales, or do they make sense in multiple locales? I don't
know where they fit in or how to document them.

* I don't understand what "kc" means if "ks" is not set to "level1".

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-17 Thread Jonathan S. Katz


On 5/17/23 6:59 PM, Jeff Davis wrote:

On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:

Other than that, and I took your suggestions almost verbatim. Patch
attached. Thank you!


Attached new patch with a typo fix and a few other edits. I plan to
commit soon.


I did a quicker read through this time. LGTM overall. I like what you 
did with the explanations around sensitivity (now it makes sense).


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

Re: Order changes in PG16 since ICU introduction

2023-05-17 Thread Jeff Davis

On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:
> Other than that, and I took your suggestions almost verbatim. Patch
> attached. Thank you!

Attached new patch with a typo fix and a few other edits. I plan to
commit soon.

Regards,
Jeff Davis

From d0d2375fa55618b60f361f6bb64b2c49490125b9 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 27 Apr 2023 14:43:46 -0700
Subject: [PATCH] Doc improvements for language tags and custom ICU collations.

Separate the documentation for language tags themselves from the
available collation settings which can be included in a language tag.

Include tables of the available options, more details about the
effects of each option, and additional examples.

Also include an explanation of the "levels" of textual features and
how they relate to collation.

Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1...@postgresql.org
Reviewed-by: Jonathan S. Katz
---
 doc/src/sgml/charset.sgml | 683 +++---
 1 file changed, 562 insertions(+), 121 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b8966..6b9c323edd 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en
 variants and customization options.

   
+  
+   ICU Locales
+   
+ICU Locale Names
+
+ The ICU format for the locale name is a Language Tag.
+
+
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+
+
+   
+   
+Locale Canonicalization and Validation
+
+ When defining a new ICU collation object or database with ICU as the
+ provider, the given locale name is transformed ("canonicalized") into a
+ language tag if not already in that form. For instance,
+
+
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE:  using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE:  using standard form "de-DE" for locale "de_DE.utf8"
+
+
+ If you see this notice, ensure that the PROVIDER and
+ LOCALE are the expected result. For consistent results
+ when using the ICU provider, specify the canonical language tag instead of relying on the
+ transformation.
+
+
+ A locale with no language name, or the special language name
+ root, is transformed to have the language
+ und ("undefined").
+
+
+ ICU can transform most libc locale names, as well as some other formats,
+ into language tags for easier transition to ICU. If a libc locale name is
+ used in ICU, it may not have precisely the same behavior as in libc.
+
+
+ If there is a problem interpreting the locale name, or if the locale name
+ represents a language or region that ICU does not recognize, you will see
+ the following warning:
+
+
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+WARNING:  ICU locale "nonsense" has unknown language "nonsense"
+HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION
+
+
+  controls how the message is
+ reported. Unless set to ERROR, the collation will
+ still be created, but the behavior may not be what the user intended.
+
+   
+   
+Language Tag
+
+ A language tag, defined in BCP 47, is a standardized identifier used to
+ identify languages, regions, and other information about a locale.
+
+
+ Basic language tags are simply
+ language-region;
+ or even just language. The
+ language is a language code
+ (e.g. fr for French), and
+ region is a region code
+ (e.g. CA for Canada). Examples:
+ ja-JP, de, or
+ fr-CA.
+
+
+ Collation settings may be included in the language tag to customize
+ collation behavior. ICU allows extensive customization, such as
+ sensitivity (or insensitivity) to accents, case, and punctuation;
+ treatment of digits within text; and many other options to satisfy a
+ variety of uses.
+
+
+ To include this additional collation information in a language tag,
+ append -u, which indicates there are additional
+ collation settings, followed by one or more
+ -key-value
+ pairs. The key is the key for a collation setting and
+ value is a valid value for that setting. For
+ boolean settings, the -key
+ may be specified without a corresponding
+ -value, which implies a
+ value of true.
+
+
+ For example, the language tag en-US-u-kn-ks-level2
+ means the locale with the English language in the US region, with
+ collation settings kn set to true
+ and ks set to level2. Those
+ settings mean the collation will be case-insensitive and treat a sequence
+ of digits as a single number:
 
+
+CREATE COLL

Re: Order changes in PG16 since ICU introduction

2023-05-16 Thread Jeff Davis

On Tue, 2023-05-16 at 15:35 -0400, Jonathan S. Katz wrote:
> +  Sensitivity when determining equality, with
> +  level1 the least sensitive and
> +  identic the most sensitive. See  +  linkend="icu-collation-levels"/> for details.
> 
> This discusses equality sensitivity, but I'm not sure if I understand
> that term here. The ICU docs seem to call these "strengths"[1], maybe
> we 
> use that term to be consistent with upstream?

"Sensitivity" comes from "case sensitivity" which is more clear to me
than "strength". I added the term "strength" to correspond to the
unicode terminology, but I kept sensitivity and I tried to make it
slightly more clear.

Other than that, and I took your suggestions almost verbatim. Patch
attached. Thank you!

I also made a few other changes:

  * added paragraph transformation of '' or 'root' to the 'und'
language (root collation)
  * added paragraph that the "identic" level still performs some basic
normalization
  * added example for when full normalization matters

I should also say that I don't really understand the case when "kc" is
set to true and "ks" is level 2 or higher. If someone has an example of
where that matters, let me know.

Regards,
Jeff Davis

From 8633ec205b0b0297910cef8f931092d0c05eb3ce Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 27 Apr 2023 14:43:46 -0700
Subject: [PATCH v2] Doc improvements for language tags and custom ICU
 collations.

Separate the documentation for language tags themselves from the
available collation settings which can be included in a language tag.

Include tables of the available options, more details about the
effects of each option, and additional examples.

Also include an explanation of the "levels" of textual features and
how they relate to collation.

Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1...@postgresql.org
Reviewed-by: Jonathan S. Katz
---
 doc/src/sgml/charset.sgml | 680 +++---
 1 file changed, 559 insertions(+), 121 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b8966..ea43732ec9 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en
 variants and customization options.

   
+  
+   ICU Locales
+   
+ICU Locale Names
+
+ The ICU format for the locale name is a Language Tag.
+
+
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+
+
+   
+   
+Locale Canonicalization and Validation
+
+ When defining a new ICU collation object or database with ICU as the
+ provider, the given locale name is transformed ("canonicalized") into a
+ language tag if not already in that form. For instance,
+
+
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE:  using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE:  using standard form "de-DE" for locale "de_DE.utf8"
+
+
+ If you see this notice, ensure that the PROVIDER and
+ LOCALE are the expected result. For consistent results
+ when using the ICU provider, specify the canonical language tag instead of relying on the
+ transformation.
+
+
+ A locale with no language name, or the special language name
+ root, is transformed to have the language
+ und ("undefined").
+
+
+ ICU can transform most libc locale names, as well as some other formats,
+ into language tags for easier transition to ICU. If a libc locale name is
+ used in ICU, it may not have precisely the same behavior as in libc.
+
+
+ If there is a problem interpreting the locale name, or if the locale name
+ represents a language or region that ICU does not recognize, you will see
+ the following error:
+
+
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+ERROR:  ICU locale "nonsense" has unknown language "nonsense"
+HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+
+
+  controls how the message is
+ reported. If set below ERROR, the collation will still
+ be created, but the behavior may not be what the user intended.
+
+   
+   
+Language Tag
+
+ A language tag, defined in BCP 47, is a standardized identifier used to
+ identify languages, regions, and other information about a locale.
+
+
+ Basic language tags are simply
+ language-region;
+ or even just language. The
+ language is a language code
+ (e.g. fr for French), and
+ region is a region code
+ (e.g. CA for Canada). Examples:
+ ja-JP, de, or
+ fr-CA.
+
+
+ Collation settings may be included in the language tag to customize
+ collation behavior. ICU allows extensive customization, s

Re: Order changes in PG16 since ICU introduction

2023-05-16 Thread Jonathan S. Katz

On 5/5/23 8:25 PM, Jeff Davis wrote:

On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:

On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis wrote:

Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try
to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.

It's still early, so we have plenty of time to revert the initdb
default if we need to.

That's fair enough, but I really think it's important that some
energy
get invested in providing adequate documentation for this stuff. Just
patching the code is not enough.

Attached a significant documentation patch.

I tried to make it comprehensive without trying to be exhaustive, and I
separated the explanation of language tags from what collation settings
you can include in a language tag, so hopefully that's more clear.

I added quite a few examples spread throughout the various sections,
and I preserved the existing examples at the end. I also left all of
the external links at the bottom for those interested enough to go
beyond what's there.

[Personal hat, not RMT]

Thanks -- this is super helpful. A bunch of these examples I had
previously had to figure out by randomly searching blog posts /
trial-and-error, so I think this will help developers get started more
quickly.

Comments (and a lot are just little nits to tighten the language)

Commit message -- typo: "documentaiton"

+ If you see such a message, ensure that the
PROVIDER and

+ LOCALE are as you expect, and consider specifying
+ directly as the canonical language tag instead of relying on the
+ transformation.
+

I'd recommend make this more prescriptive:

"If you see this notice, ensure that the PROVIDER and
LOCALE are the expected result. For consistent results
when using the ICU provider, specify the canonical linkend="icu-language-tag">language tag instead of relying on the
transformation."

+ If there is some problem interpreting the locale name, or if it
represents
+ a language or region that ICU does not recognize, a message will
be reported:

This is passive voice, consider:

"If there is a problem interpreting the locale name, or if the locale
name represents a language or region that ICU does not recognize, you'll
see the following error:"

+
+Language Tag
+

Before jumping in, I'd recommend a quick definition of what a language
tag is, e.g.:

"A language tag, defined in BCP 47, is a standardized identifier used to
identify languages in computer systems" or something similar.

(I did find a database that made it simpler to search for these, which
is one issue I've previously add, but I don't think we'd want to link to i)

+ To include this additional collation information in a language tag,
+ append -u, followed by one or more

My first question was "what's special about '-u'", so maybe we say:

"To include this additional collation information in a language tag,
append -u, which indicates there are additional
collation settings, followed by one or more..."

+ ICU locales are specified as a linkend="icu-language-tag">Language
+ Tag, but can also accept most libc-style locale names
(which will

+ be transformed into language tags if possible).
+

I'd recommend removing the parantheticals:

ICU locales are specified as a BCP 47 linkend="icu-language-tag">Language
Tag, but can also accept most libc-style locale names. If
possible, libc-style locale names are transformed into language tags.

+ ICU Collation Levels

Nothing to add here other than to say I'm extremely appreciative of this
section. Once upon a time I sunk a lot of time trying to figure out how
all of these levels worked.

+ Sensitivity when determining equality, with
+ level1 the least sensitive and
+ identic the most sensitive. See for details.

This discusses equality sensitivity, but I'm not sure if I understand
that term here. The ICU docs seem to call these "strengths"[1], maybe we
use that term to be consistent with upstream?

+ If set to upper, upper case sorts before lower
+ case. If set to lower, lower case sorts before
+ upper case. If set to false, it depends on the
+ locale.

Suggestion to tighten this up:

"If set to false, the sort depends on the rules of
the locale."

+ Defaults may depend on locale. The above table is not meant to be
+ complete. See for additinal
+ options and details.

Typo: additinal => "additional"

I didn't add additional documentation for ICU rules. There are so many
options for collations that it's hard for me to think of realistic
examples to specify the rules directly, unless someone wants to invent
a new language. Perhaps useful if working with an interes

Re: Order changes in PG16 since ICU introduction

2023-05-16 Thread Jeff Davis

On Tue, 2023-05-16 at 19:00 +0300, Alexander Lakhin wrote:
> I'm not sure about the proposed change in icu_from_uchar(). It seems
> that
> len_result + 1 bytes should always be enough for the result string
> terminated
> with NUL. If that's not true (we want to protect from some ICU bug
> here),
> then the change should be backpatched?

I believe it's enough and I'm not aware of any bug there so no backport
is required.

I added checks in places that were (a) checking for U_FAILURE; and (b)
expecting the result to be NUL-terminated. That's mostly callers of
uloc_getLanguage(), where I was not quite paranoid enough.

There were a couple other places though, and I went ahead and added
checks there out of paranoia, too. One was ucnv_fromUChars(), and the
other was uloc_canonicalize().

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-16 Thread Alexander Lakhin


Hi Jeff,

16.05.2023 00:03, Jeff Davis wrote:

On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote:

On the current master (after 455f948b0, and before f7faa9976, of
course)
I get an ASAN-detected failure with the following query:
CREATE COLLATION col (provider = icu, locale = '123456789012');


Thank you for the report!

ICU source specifically says:

   /**
* Useful constant for the maximum size of the language
  part of a locale ID.
* (including the terminating NULL).
* @stable ICU 2.0
*/
   #define ULOC_LANG_CAPACITY 12

So the fact that it returning success without nul-terminating the
result is an ICU bug in my opinion, and I reported it here:

https://unicode-org.atlassian.net/browse/ICU-22394

Unfortunately that means we need to be a bit more paranoid and always
check for that warning, even if we have a preallocated buffer of the
"correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING
and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially
scary), unless we check for those errors each time and report specific
errors for them.

Another approach is to always check the length and loop using dynamic
allocation so that we never overflow (and forget about any static
buffers). That seems like overkill given that the problem case is
obviously invalid input; I think as long as we catch it and throw an
ERROR it's fine. But I can do this if others think it's worthwhile.

Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING
in a few places and turns it into an ERROR. It also cleans up the loop
for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING
rather than (U_SUCCESS(status) && len >= buflen).


I'm not sure about the proposed change in icu_from_uchar(). It seems that
len_result + 1 bytes should always be enough for the result string terminated
with NUL. If that's not true (we want to protect from some ICU bug here),
then the change should be backpatched?

Best regards,
Alexander

Re: Order changes in PG16 since ICU introduction

2023-05-15 Thread Jeff Davis

On Mon, 2023-05-08 at 14:59 -0700, Jeff Davis wrote:
> The easiest thing to do is revert it for now, and after we sort out
> the
> memcmp() path for the ICU provider, then I can commit it again (after
> that point it would just be code cleanup and should have no
> functional
> impact).

The conversion won't be entirely dead code even after we handle the "C"
locale with memcmp(): for a locale like "C.UTF-8", it will still be
passed to the collation provider (same as with libc), and in that case,
we should still convert that to a language tag consistently across ICU
versions.

For it to be entirely dead code, we would need to convert any locale
with language "C" (e.g. "C.UTF-8") to use the memcmp() path. I'm fine
with that, but that's not what the libc provider does today, and
perhaps we should be consistent between the two. If we do leave the
code in place, we can document that specific "en-US-u-va-posix" locale
so that it's not too surprising for users.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-15 Thread Jeff Davis

On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote:
> On the current master (after 455f948b0, and before f7faa9976, of
> course)
> I get an ASAN-detected failure with the following query:
> CREATE COLLATION col (provider = icu, locale = '123456789012');
> 

Thank you for the report!

ICU source specifically says:

  /** 
   * Useful constant for the maximum size of the language
 part of a locale ID. 
   * (including the terminating NULL).
   * @stable ICU 2.0  
   */
  #define ULOC_LANG_CAPACITY 12

So the fact that it returning success without nul-terminating the
result is an ICU bug in my opinion, and I reported it here:

https://unicode-org.atlassian.net/browse/ICU-22394

Unfortunately that means we need to be a bit more paranoid and always
check for that warning, even if we have a preallocated buffer of the
"correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING
and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially
scary), unless we check for those errors each time and report specific
errors for them.

Another approach is to always check the length and loop using dynamic
allocation so that we never overflow (and forget about any static
buffers). That seems like overkill given that the problem case is
obviously invalid input; I think as long as we catch it and throw an
ERROR it's fine. But I can do this if others think it's worthwhile.

Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING
in a few places and turns it into an ERROR. It also cleans up the loop
for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING
rather than (U_SUCCESS(status) && len >= buflen).

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

From 9c8e9272ca807c9f75a7b32fa3190700cc600260 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 15 May 2023 13:35:07 -0700
Subject: [PATCH] ICU: check for U_STRING_NOT_TERMINATED_WARNING.

In some cases, ICU can fail to NUL-terminate a result string even if
using an appropriately-sized static buffer. The caller must either
check for len >= buflen or U_STRING_NOT_TERMINATED_WARNING.

The specific problem is related to uloc_getLanguage(), but add the
check in other cases as well.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/2098874d-c111-41e4-9063-30bcf1352...@gmail.com
---
 src/backend/utils/adt/pg_locale.c | 29 +++--
 src/bin/initdb/initdb.c   | 15 ---
 2 files changed, 15 insertions(+), 29 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f0b6567da1..1cf93b2d20 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2468,7 +2468,7 @@ pg_ucol_open(const char *loc_str)

 		status = U_ZERO_ERROR;
 		uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-		if (U_FAILURE(status))
+		if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
 		{
 			ereport(ERROR,
 	(errmsg("could not get language from locale \"%s\": %s",
@@ -2504,7 +2504,7 @@ pg_ucol_open(const char *loc_str)
 		 * Pretend the error came from ucol_open(), for consistent error
 		 * message across ICU versions.
 		 */
-		if (U_FAILURE(status))
+		if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
 		{
 			ucol_close(collator);
 			ereport(ERROR,
@@ -2639,7 +2639,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
 	status = U_ZERO_ERROR;
 	len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
  buff_uchar, len_uchar, &status);
-	if (U_FAILURE(status))
+	if (U_FAILURE(status) ||
+		status == U_STRING_NOT_TERMINATED_WARNING)
 		ereport(ERROR,
 (errmsg("%s failed: %s", "ucnv_fromUChars",
 		u_errorName(status;
@@ -2681,7 +2682,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 	icu_locale_id = palloc(len + 1);
 	*status = U_ZERO_ERROR;
 	len = uloc_canonicalize(loc, icu_locale_id, len + 1, status);
-	if (U_FAILURE(*status))
+	if (U_FAILURE(*status) || *status == U_STRING_NOT_TERMINATED_WARNING)
 		return;

 	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
@@ -2765,7 +2766,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,

 	pfree(lower_str);
 }
-
 #endif

 /*
@@ -2789,7 +2789,7 @@ icu_language_tag(const char *loc_str, int elevel)

 	status = U_ZERO_ERROR;
 	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
+	if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
 	{
 		if (elevel > 0)
 			ereport(elevel,
@@ -2811,19 +2811,12 @@ icu_language_tag(const char *loc_str, int elevel)
 	langtag = palloc(buflen);
 	while (true)
 	{
-		int32_t		len;
-
 		status = U_ZERO_ERROR;
-		len = uloc_toLanguageTag(loc_str, langtag, buf

Re: Order changes in PG16 since ICU introduction

2023-05-15 Thread Peter Eisentraut

On 11.05.23 23:29, Jeff Davis wrote:

New patch series attached.

=== 0001: fix bug that allows creating hidden collations

Bug:
https://www.postgresql.org/message-id/051c9395cf880307865ee8b17acdbf7f838c1e39.ca...@j-davis.com

This is still being debated in the other thread. Not really related to
this thread, so I suggest dropping it from this patch series.

=== 0002: handle some kinds of libc-stlye locale strings

ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't
in later versions. Handle them in postgres for consistency.

I tend to agree with ICU that these variants are obsolete, and we don't
need to support them anymore. If this were a tiny patch, then maybe ok,
but the way it's presented here the whole code is duplicated between
pg_locale.c and initdb.c, which is not great.

=== 0003: reduce icu_validation_level to WARNING

Given that we've seen some inconsistency in which locale names are
accepted in different ICU versions, it seems best not to be too strict.
Peter Eisentraut suggested that it be set to ERROR originally, but a
WARNING should be sufficient to see problems without introducing risks
migrating to version 16.

I'm not sure why this is the conclusion. Presumably, the detection
capabilities of ICU improve over time, so we want to take advantage of
that? What are some example scenarios where this change would help?

=== 0004-0006:

To solve the issues that have come up in this thread, we need CREATE
DATABASE (and createdb and initdb) to use LOCALE to mean the collation
locale regardless of which provider is in use (which is what 0006
does).

0006 depends on ICU handling libc locale names. It already does a good
job for most libc locale names (though patch 0002 fixes a few cases
where it doesn't). There may be more cases, but for the most part libc
names are interpreted in a reasonable way. But one important case is
missing: ICU does not handle the "C" locale as we expect (that is,
using memcmp()).

We've already allowed users to create ICU collations with the C locale
in the past, which uses the root collation (not memcmp()), and we need
to keep supporting that for upgraded clusters.

I'm not sure I agree that we need to keep supporting that. The only way
you could get that in past releases is if you specify explicitly, "give
me provider ICU and locale C", and then it wouldn't actually even work
correctly. So nobody should be using that in practice, and nobody
should have stumbled into that combination of settings by accident.

3. Introduce collation provider "none", which is always memcmp-based
(patch 0004). It's equivalent to the libc locale=C, but it allows
specifying the LC_COLLATE and LC_CTYPE independently. A command like
CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C'
LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none"
(patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For
the database default collation, that would always use memcmp(), but the
server environment LC_COLLATE would be set to en_US as the user
specified.

This seems most attractive, but I think it's quite invasive at this
point, especially given the dubious premise (see above).

=== 0007: Add a GUC to control the default collation provider

Having a GUC would make it easier to migrate to ICU without surprises.
This only affects the default for CREATE COLLATION, not CREATE DATABASE
(and obviously not initdb).

It's not clear to me why we would want that. Also not clear why it
should only affect CREATE COLLATION.

Re: Order changes in PG16 since ICU introduction

2023-05-13 Thread Alexander Lakhin


Hello Jeff,

09.05.2023 00:59, Jeff Davis wrote:

The easiest thing to do is revert it for now, and after we sort out the
memcmp() path for the ICU provider, then I can commit it again (after
that point it would just be code cleanup and should have no functional
impact).


On the current master (after 455f948b0, and before f7faa9976, of course)
I get an ASAN-detected failure with the following query:
CREATE COLLATION col (provider = icu, locale = '123456789012');

==2929883==ERROR: AddressSanitizer: stack-buffer-overflow on address 
0x7ffc491be09c at pc 0x556e8571a260 bp 0x7
ffc491be020 sp 0x7ffc491bd7c8
READ of size 15 at 0x7ffc491be09c thread T0
    #0 0x556e8571a25f in __interceptor_strcmp.part.0 
(.../usr/local/pgsql/bin/postgres+0x2aa025f)
    #1 0x556e86d77ee6 in icu_language_tag 
.../src/backend/utils/adt/pg_locale.c:2802
...
Address 0x7ffc491be09c is located in stack of thread T0 at offset 76 in frame
    #0 0x556e86d77cfe in icu_language_tag 
.../src/backend/utils/adt/pg_locale.c:2782

  This frame has 2 object(s):
    [48, 52) 'status' (line 2784)
    [64, 76) 'lang' (line 2785) <== Memory access at offset 76 overflows this 
variable
...

Here, uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status) returns
status = -124, i.e.,
    U_STRING_NOT_TERMINATED_WARNING = -124,/**< An output string could not be NUL-terminated because output 
length==destCapacity. */

(ULOC_LANG_CAPACITY = 12)
this value is not covered by U_FAILURE(status), and strcmp(), that follows,
goes out of the lang variable bounds.

Best regards,
Alexander

Re: Order changes in PG16 since ICU introduction

2023-05-11 Thread Jeff Davis

New patch series attached.

=== 0001: fix bug that allows creating hidden collations

Bug:
https://www.postgresql.org/message-id/051c9395cf880307865ee8b17acdbf7f838c1e39.ca...@j-davis.com

=== 0002: handle some kinds of libc-stlye locale strings

ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't
in later versions. Handle them in postgres for consistency.

=== 0003: reduce icu_validation_level to WARNING

Given that we've seen some inconsistency in which locale names are
accepted in different ICU versions, it seems best not to be too strict.
Peter Eisentraut suggested that it be set to ERROR originally, but a
WARNING should be sufficient to see problems without introducing risks
migrating to version 16.

I don't expect objections to 0003, so I may commit this soon, but I'll
give it a little time in case someone has an opinion.

=== 0004-0006: 

To solve the issues that have come up in this thread, we need CREATE
DATABASE (and createdb and initdb) to use LOCALE to mean the collation
locale regardless of which provider is in use (which is what 0006
does).

0006 depends on ICU handling libc locale names. It already does a good
job for most libc locale names (though patch 0002 fixes a few cases
where it doesn't). There may be more cases, but for the most part libc
names are interpreted in a reasonable way. But one important case is
missing: ICU does not handle the "C" locale as we expect (that is,
using memcmp()).

We've already allowed users to create ICU collations with the C locale
in the past, which uses the root collation (not memcmp()), and we need
to keep supporting that for upgraded clusters. So that leaves us with a
catalog representation problem. I mentioned upthread that we can solve
that by:

  1. Using iculocale=NULL to mean "C-as-in-memcmp", or having some
other catalog hack (like another field). That's not desirable because
the catalog representation is already complex and it may be hard for
users to tell what's happening.

  2. When provider=icu and locale=C, switch to provider=libc locale=C.
This is very messy, because currently the syntax allows specifying a
database with LOCALE_PROVIDER='icu' ICU_LOCALE='C' LC_COLLATE='en_US' -
- if the provider gets changed to libc, what would we set datcollate
to? I don't think this is workable without some breakage. We can't
simply override datcollate to be C in that case, because there are some
things other than the default collation that might need it set to en_US
as the user specified.

  3. Introduce collation provider "none", which is always memcmp-based
(patch 0004). It's equivalent to the libc locale=C, but it allows
specifying the LC_COLLATE and LC_CTYPE independently. A command like
CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C'
LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none"
(patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For
the database default collation, that would always use memcmp(), but the
server environment LC_COLLATE would be set to en_US as the user
specified.

For this patch series, I chose approach #3. I think it works out nicely
-- it provides a better place to document the "no locale" behavior
(including a warning that it depends on the database encoding), and I
think it's more clear to the user that locale=C is not actually using a
provider at all. It's more invasive, but feels like a better solution.
If others don't like it I can implement approach #1 instead.

=== 0007: Add a GUC to control the default collation provider

Having a GUC would make it easier to migrate to ICU without surprises.
This only affects the default for CREATE COLLATION, not CREATE DATABASE
(and obviously not initdb).


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From fc66f02976bb11b629bcf71346c2858eccbcf1a3 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 11 May 2023 10:36:04 -0700
Subject: [PATCH v5 1/7] For user-defined collations, never set
 collencoding=-1.

For new user-defined collations, always set collencoding to the
current database encoding so that it is never shadowed by a built-in
collation.

Built in collations that work with any encoding may have
collencoding=-1, and if a user defines a collation with the same name,
it will shadow the built-in collation.

Previously it was possible to create an ICU collation (which was
assigned collencoding=-1) that was shadowed by a built-in collation
and completely inaccessible.
---
 src/backend/commands/collationcmds.c  | 28 +--
 .../regress/expected/collate.icu.utf8.out |  2 +-
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c91fe66d9b..a53700256b 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -302,16 +302,29 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 	 e

Re: Order changes in PG16 since ICU introduction

2023-05-11 Thread Peter Eisentraut


On 09.05.23 17:09, Jeff Davis wrote:

It's awkward for a user to read pg_database.datlocprovider, then
depending on that, either look in datcollate or daticulocale. (It's
awkward in the code, too.)

Maybe some built-in function that returns a tuple of the default
provider, the locale, and the version? Or should we also output the
ctype somehow (which affects the results of upper()/lower())?


There is also the deterministic flag and the icurules setting. 
Depending on what level of detail you imagine the user needs, you really 
do need to look at the whole picture, not some subset of it.

Re: Order changes in PG16 since ICU introduction

2023-05-11 Thread Peter Eisentraut


On 09.05.23 10:25, Alvaro Herrera wrote:

On 2023-Apr-24, Peter Eisentraut wrote:


The GUC settings lc_collate and lc_ctype are from a time when those locale
settings were cluster-global.  When we made those locale settings
per-database (PG 8.4), we kept them as read-only.  As of PG 15, you can use
ICU as the per-database locale provider, so what is being attempted in the
above example is already meaningless before PG 16, since you need to look
into pg_database to find out what is really happening.

I think we should just remove the GUC parameters lc_collate and lc_ctype.


I agree with removing these in v16, since they are going to become more
meaningless and confusing.


Here is my proposed patch for this.
From b548a671ad02a5c851a4984db6e4535a0b70f881 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 11 May 2023 13:02:02 +0200
Subject: [PATCH] Remove read-only server settings lc_collate and lc_ctype

The GUC settings lc_collate and lc_ctype are from a time when those
locale settings were cluster-global.  When those locale settings were
made per-database (PG 8.4), the settings were kept as read-only.  As
of PG 15, you can use ICU as the per-database locale provider, so
examining these settings is already meaningless, since you need to
look into pg_database to find out what is really happening.

Discussion: 
https://www.postgresql.org/message-id/696054d1-bc88-b6ab-129a-18b8bce6a...@enterprisedb.com
---
 contrib/citext/expected/citext_utf8.out   |  4 +--
 contrib/citext/expected/citext_utf8_1.out |  4 +--
 contrib/citext/sql/citext_utf8.sql|  4 +--
 doc/src/sgml/config.sgml  | 32 ---
 src/backend/utils/init/postinit.c |  4 ---
 src/backend/utils/misc/guc_tables.c   | 26 ---
 .../regress/expected/collate.icu.utf8.out |  4 +--
 .../regress/expected/collate.linux.utf8.out   |  6 ++--
 .../expected/collate.windows.win1252.out  |  6 ++--
 src/test/regress/sql/collate.icu.utf8.sql |  4 +--
 src/test/regress/sql/collate.linux.utf8.sql   |  6 ++--
 .../regress/sql/collate.windows.win1252.sql   |  6 ++--
 12 files changed, 22 insertions(+), 84 deletions(-)

diff --git a/contrib/citext/expected/citext_utf8.out 
b/contrib/citext/expected/citext_utf8.out
index 77b4586d8f..6630e09a4d 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -8,8 +8,8 @@
  * to the "tr-TR-x-icu" collation where it will succeed.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-   current_setting('lc_ctype') = 'C' OR
-   (SELECT datlocprovider='i' FROM pg_database
+   (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 
'i'
+FROM pg_database
 WHERE datname=current_database())
AS skip_test \gset
 \if :skip_test
diff --git a/contrib/citext/expected/citext_utf8_1.out 
b/contrib/citext/expected/citext_utf8_1.out
index d1e1fe1a9d..3caa7a00d4 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -8,8 +8,8 @@
  * to the "tr-TR-x-icu" collation where it will succeed.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-   current_setting('lc_ctype') = 'C' OR
-   (SELECT datlocprovider='i' FROM pg_database
+   (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 
'i'
+FROM pg_database
 WHERE datname=current_database())
AS skip_test \gset
 \if :skip_test
diff --git a/contrib/citext/sql/citext_utf8.sql 
b/contrib/citext/sql/citext_utf8.sql
index 8530c68dd7..1f51df134b 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -9,8 +9,8 @@
  */
 
 SELECT getdatabaseencoding() <> 'UTF8' OR
-   current_setting('lc_ctype') = 'C' OR
-   (SELECT datlocprovider='i' FROM pg_database
+   (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 
'i'
+FROM pg_database
 WHERE datname=current_database())
AS skip_test \gset
 \if :skip_test
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 909a3f28c7..3e9030e3d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10788,38 +10788,6 @@ Preset Options
   
  
 
- 
-  lc_collate (string)
-  
-   lc_collate configuration parameter
-  
-  
-  
-   
-Reports the locale in which sorting of textual data is done.
-See  for more information.
-This value is determined when a database is created.
-   
-  
- 
-
- 
-  lc_ctype (string)
-  
-   lc_ctype configuration parameter
-  
-  
-  
-   
-Reports the locale that determines character classifications.
-See  for more information.
-This value is determined when a database is created.
-Ordinarily this will be the same as lc_collate,
-but for special applications it might be set differently.
-   
-  
-

Re: Order changes in PG16 since ICU introduction

2023-05-09 Thread Jeff Davis

On Tue, 2023-05-09 at 10:25 +0200, Alvaro Herrera wrote:
> I agree with removing these in v16, since they are going to become
> more
> meaningless and confusing.

Agreed, but it would be nice to have an alternative that does the right
thing.

It's awkward for a user to read pg_database.datlocprovider, then
depending on that, either look in datcollate or daticulocale. (It's
awkward in the code, too.)

Maybe some built-in function that returns a tuple of the default
provider, the locale, and the version? Or should we also output the
ctype somehow (which affects the results of upper()/lower())?

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-09 Thread Alvaro Herrera

On 2023-Apr-24, Peter Eisentraut wrote:

> The GUC settings lc_collate and lc_ctype are from a time when those locale
> settings were cluster-global.  When we made those locale settings
> per-database (PG 8.4), we kept them as read-only.  As of PG 15, you can use
> ICU as the per-database locale provider, so what is being attempted in the
> above example is already meaningless before PG 16, since you need to look
> into pg_database to find out what is really happening.
> 
> I think we should just remove the GUC parameters lc_collate and lc_ctype.

I agree with removing these in v16, since they are going to become more
meaningless and confusing.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Order changes in PG16 since ICU introduction

2023-05-08 Thread Jeff Davis

On Mon, 2023-05-08 at 17:47 -0400, Tom Lane wrote:
> -ERROR:  could not convert locale name "C" to language tag:
> U_ILLEGAL_ARGUMENT_ERROR
> +NOTICE:  using standard form "en-US-u-va-posix" for locale "C"

...

> I suppose this is environment-dependent.  Sadly, the buildfarm
> client does not show the prevailing LANG or LC_XXX settings.

Looks like it's failing-to-fail on some versions of ICU which
automatically perform that conversion.

The easiest thing to do is revert it for now, and after we sort out the
memcmp() path for the ICU provider, then I can commit it again (after
that point it would just be code cleanup and should have no functional
impact).

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-05-08 Thread Tom Lane

Jeff Davis  writes:
> === 0001: do not convert C to en-US-u-va-posix

> I plan to commit this soon.

Several buildfarm animals have failed since this went in.  The
only one showing enough info to diagnose is siskin [1]:

@@ -1043,16 +1043,15 @@
 ERROR:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to 
DISABLED.
 CREATE COLLATION testx (provider = icu, locale = 'C'); -- fails
-ERROR:  could not convert locale name "C" to language tag: 
U_ILLEGAL_ARGUMENT_ERROR
+NOTICE:  using standard form "en-US-u-va-posix" for locale "C"
 CREATE COLLATION testx (provider = icu, locale = 
'@colStrength=primary;nonsense=yes'); -- fails
 ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to 
language tag: U_ILLEGAL_ARGUMENT_ERROR
 SET icu_validation_level = WARNING;
 CREATE COLLATION testx (provider = icu, locale = 
'@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 WARNING:  could not convert locale name "@colStrength=primary;nonsense=yes" to 
language tag: U_ILLEGAL_ARGUMENT_ERROR
+ERROR:  collation "testx" already exists
 CREATE COLLATION testx (provider = icu, locale = 'C'); DROP COLLATION testx;
-WARNING:  could not convert locale name "C" to language tag: 
U_ILLEGAL_ARGUMENT_ERROR
-WARNING:  ICU locale "C" has unknown language "c"
-HINT:  To disable ICU locale validation, set parameter icu_validation_level to 
DISABLED.
+NOTICE:  using standard form "en-US-u-va-posix" for locale "C"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP 
COLLATION testx;
 WARNING:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to 
DISABLED.

I suppose this is environment-dependent.  Sadly, the buildfarm
client does not show the prevailing LANG or LC_XXX settings.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=siskin&dt=2023-05-08%2020%3A09%3A26

Re: Order changes in PG16 since ICU introduction

2023-05-05 Thread Jeff Davis

On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
> On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis  wrote:
> > Most of the complaints seem to be complaints about v15 as well, and
> > while those complaints may be a reason to not make ICU the default,
> > they are also an argument that we should continue to learn and try
> > to
> > fix those issues because they exist in an already-released version.
> > Leaving it the default for now will help us fix those issues rather
> > than hide them.
> > 
> > It's still early, so we have plenty of time to revert the initdb
> > default if we need to.
> 
> That's fair enough, but I really think it's important that some
> energy
> get invested in providing adequate documentation for this stuff. Just
> patching the code is not enough.

Attached a significant documentation patch.

I tried to make it comprehensive without trying to be exhaustive, and I
separated the explanation of language tags from what collation settings
you can include in a language tag, so hopefully that's more clear.

I added quite a few examples spread throughout the various sections,
and I preserved the existing examples at the end. I also left all of
the external links at the bottom for those interested enough to go
beyond what's there.

I didn't add additional documentation for ICU rules. There are so many
options for collations that it's hard for me to think of realistic
examples to specify the rules directly, unless someone wants to invent
a new language. Perhaps useful if working with an interesting text file
format with special treatment for delimiters?

I asked the question about rules here:

https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at

and got some limited response about addressing sort complaints. That
sounds reasonable, but a lot of that can also be handled just by
specifying the right collation settings. Someone who understands the
use case better could add some more documentation.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From b09515bfaf5e9de330138ec4a627d02a7947de1a Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 27 Apr 2023 14:43:46 -0700
Subject: [PATCH v1] Doc improvements for language tags and custom ICU
 collations.

Separate the documentation for language tags from the documentaiton
for the available collation settings which can be included in a
language tag.

Include tables of the available options, more details about the
effects of each option, and additional examples.

Also include an explanation of the "levels" of textual features and
how they relate to collation.
---
 doc/src/sgml/charset.sgml | 656 +++---
 1 file changed, 535 insertions(+), 121 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b8966..be74064168 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,125 @@ initdb --locale-provider=icu --icu-locale=en
 variants and customization options.

   
+  
+   ICU Locales
+   
+ICU Locale Names
+
+ The ICU format for the locale name is a Language Tag.
+
+
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+
+
+   
+   
+Locale Canonicalization and Validation
+
+ When defining a new ICU collation object or database with ICU as the
+ provider, the given locale name is transformed ("canonicalized") into a
+ language tag if not already in that form. For instance,
+
+
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE:  using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE:  using standard form "de-DE" for locale "de_DE.utf8"
+
+
+ If you see such a message, ensure that the PROVIDER and
+ LOCALE are as you expect, and consider specifying
+ directly as the canonical language tag instead of relying on the
+ transformation.
+
+
+ 
+  ICU can transform most libc locale names, as well as some other formats,
+  into language tags for easier transition to ICU. If a libc locale name
+  is used in ICU, it may not have precisely the same behavior as in libc.
+ 
+
+
+ If there is some problem interpreting the locale name, or if it represents
+ a language or region that ICU does not recognize, a message will be reported:
 
+
+SET icu_validation_level = ERROR;
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+ERROR:  ICU locale "nonsense" has unknown language "nonsense"
+HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+
+
+  controls how the message is
+ reported. If set below ERROR, the collation will still
+ be created, but the behavior may not be what the user intended.
+
+   
+   
+Language Tag
+
+ Basic language tags are simply

Re: Order changes in PG16 since ICU introduction

2023-05-03 Thread Jeff Davis

On Fri, 2023-04-28 at 14:35 -0700, Jeff Davis wrote:
> On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote:
> > This should be pg_strcasecmp(...) == 0
> 
> Good catch, thank you! Fixed in updated patches.

Rebased patches.

=== 0001: do not convert C to en-US-u-va-posix

I plan to commit this soon. If someone specifies "C", they are probably
expecting memcmp()-like behavior, or some kind of error/warning that it
can't be provided.

Removing this transformation means that if you specify iculocale=C,
you'll get an error or warning (depending on icu_validation_level),
because C is not a recognized icu locale. Depending on how some of the
other issues in this thread are sorted out, we may want to relax the
validation.

=== 0002: fix @euro, etc. in ICU >= 64

I'd like to commit this soon too, but I'll wait for someone to take a
look. It makes it more reliable to map libc names to icu locale names
regardless of the ICU version.

It doesn't solve the problem for locales like "de__PHONEBOOK", but
those don't seem to be a libc format (I think just an old ICU format),
so I don't see a big reason to carry it forward. It might be another
reason to turn down the validation level to WARNING, though.

=== 0003: support C memcmp() behavior with ICU provider

The current patch 0003 has a problem, because in previous postgres
versions (going all the way back), we allowed "C" as a valid ICU
locale, that would actually be passed to ICU as a locale name. But ICU
didn't recognize it, and it would end up opening the root locale. So we
can't simply redefine "C" to mean "memcmp", because that would
potentially break indexes.

I see the following potential solutions:

   1. Represent the memcmp behavior with iculocale=NULL, or some other
catalog hack, so that we can distinguish between a locale "C" upgraded
from a previous version (which should pass "C" to ICU and get the root
locale), and a new collation defined with locale "C" (which should have
memcmp behavior). The catalog representation for locale information is
already complex, so I'm not excited about this option, but it will
work.

   2. When provider=icu and locale=C, magically transform that into
provider=libc to get memcmp-like behavior for new collations but
preserve the existing behavior for upgraded collations. Not especially
clean, but if we issue a NOTICE perhaps that would avoid confusion.

   3. Like #2, except create a new provider type "none" which may be
slightly less confusing.

=== 0004: make LOCALE apply to ICU for CREATE DATABASE

To understand this patch it helps to understand the confusing situation
with CREATE DATABASE in version 15:

The keywords LC_CTYPE and LC_COLLATE set the server environment
LC_CTYPE/LC_COLLATE for that database and can be specified regardless
of the provider. LOCALE can be specified along with (or instead of)
LC_CTYPE and LC_COLLATE, in which case whichever of LC_CTYPE or
LC_COLLATE is unspecified defaults to the setting of LOCALE. Iff the
provider is libc, LC_CTYPE and LC_COLLATE also act as the database
default collation's locale. If the provider is icu, then none of
LOCALE, LC_CTYPE, or LC_COLLATE affect the database default collation's
locale at all; that's controlled by ICU_LOCALE (which may be omitted if
the template's daticulocale is non-NULL).

The idea of patch 0004 is to address the last part, which is probably
the most confusing aspect. But for that to work smoothly, we need
something like 0003 so that LOCALE=C gives the same semantics
regardless of the provider.

Regards,
Jeff Davis

From ddda683963959a175dff17ab0e3d8519641498b9 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH v4 1/4] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.

The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.

The user can still directly specify 'en-US-u-va-posix' if desired.

Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com
---
 src/backend/utils/adt/pg_locale.c | 19 +--
 src/bin/initdb/initdb.c   | 17 +
 .../regress/expected/collate.icu.utf8.out |  8 
 src/test/regress/sql/collate.icu.utf8.sql |  4 
 4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f0b6567da1..51b4221a39 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
 {
 #ifdef USE_ICU
 	UErrorCode	 status;
-	char		 lang[ULOC_LANG_CAPACITY];
 	char		*langtag;
 	size_t		 buflen = 32;	/* arbitrary starting buffer size */
 	const bool	 strict = true;

-	status = U_ZERO_ERROR;
-	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
-	{
-		if (elevel > 0)
-			ereport(elevel,
-	(errmsg("could not get language from locale \"

Re: Order changes in PG16 since ICU introduction

2023-04-28 Thread Jeff Davis

On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote:
> This should be pg_strcasecmp(...) == 0

Good catch, thank you! Fixed in updated patches.

> postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9
> template
> 'template0';
> ERROR:  could not convert locale name "fr_FR@euro" to language tag:
> U_ILLEGAL_ARGUMENT_ERROR

ICU 63 and earlier convert it without error to the language tag 'fr-FR-
u-cu-eur', which is correct. ICU 64 removed support for transforming
some locale variants, because apparently they think those variants are
obsolete:

https://unicode-org.atlassian.net/browse/ICU-22268
https://unicode-org.atlassian.net/browse/ICU-20187

(Aside: how obsolete are those variants?)

It's frustrating that they'd remove such transformations from the
canonicalization process.

Fortunately, it looks like it's easy enough to do the transformation
ourselves. The only problematic format is '...@VARIANT'. The other
format 'fr_FR_EURO' doesn't seem to be a valid glibc locale name[1] and
windows seems to use BCP 47[2].

And there don't seem to be a lot of variants to handle. ICU 63 only
handles 3 variants, so that's what my patch does. Any unknown variant
between 5 and 8 characters won't throw an error. There could be more
problem cases, but I'm not sure how much of a practical problem they
are.

If we try to keep the meaning of LOCALE to only LC_COLLATE and
LC_CTYPE, that will continue to be confusing for anyone that uses
provider=icu.

Regards,
Jeff Davis

[1]
https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html
[2]
https://learn.microsoft.com/en-us/windows/win32/intl/locale-names
From 6c0251c584edea64148604da52c8e55e43fe36e6 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH v3 1/4] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.

The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.

The user can still directly specify 'en-US-u-va-posix' if desired.

Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com
---
 src/backend/utils/adt/pg_locale.c | 19 +--
 src/bin/initdb/initdb.c   | 17 +
 .../regress/expected/collate.icu.utf8.out |  8 
 src/test/regress/sql/collate.icu.utf8.sql |  4 
 4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 51df570ce9..58c4c426bc 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
 {
 #ifdef USE_ICU
 	UErrorCode	 status;
-	char		 lang[ULOC_LANG_CAPACITY];
 	char		*langtag;
 	size_t		 buflen = 32;	/* arbitrary starting buffer size */
 	const bool	 strict = true;
 
-	status = U_ZERO_ERROR;
-	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
-	{
-		if (elevel > 0)
-			ereport(elevel,
-	(errmsg("could not get language from locale \"%s\": %s",
-			loc_str, u_errorName(status;
-		return NULL;
-	}
-
-	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
-	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
-		return pstrdup("en-US-u-va-posix");
-
 	/*
 	 * A BCP47 language tag doesn't have a clearly-defined upper limit
 	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str)
 
 	/* check for special language name */
 	if (strcmp(lang, "") == 0 ||
-		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
-		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
 		found = true;
 
 	/* search for matching language within ICU */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..4086834458 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str)
 {
 #ifdef USE_ICU
 	UErrorCode	 status;
-	char		 lang[ULOC_LANG_CAPACITY];
 	char		*langtag;
 	size_t		 buflen = 32;	/* arbitrary starting buffer size */
 	const bool	 strict = true;
 
-	status = U_ZERO_ERROR;
-	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
-	{
-		pg_fatal("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status));
-		return NULL;
-	}
-
-	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
-	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
-		return pstrdup("en-US-u-va-posix");
-
 	/*
 	 * A BCP47 language tag doesn't have a clearly-defined upper limit
 	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str)
 
 	/* check for special language name */
 	if (strcmp(lang, "") == 0 ||
-		strcmp(lang, "root") == 0 || strcmp(lang, "und") ==

Re: Order changes in PG16 since ICU introduction

2023-04-27 Thread Daniel Verite

Jeff Davis wrote:

> Attached are a few small patches:
> 
>   0001: don't convert C to en-US-u-va-posix
>   0002: handle locale C the same regardless of the provider, as you
> suggest above
>   0003: make LOCALE (or --locale) apply to everything including ICU

Testing this briefly I noticed two regressions

1) all pg_collation.collversion are empty due to a trivial bug in 0002:

@ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const
char *collcollate)
 {
char   *collversion = NULL;

+   if (pg_strcasecmp("C", collcollate) ||
+   pg_strcasecmp("POSIX", collcollate))
+   return NULL;
+

This should be pg_strcasecmp(...) == 0

2) The following works with HEAD (default provider=icu) but errors out with
the patches:

postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9 template
'template0';
ERROR:  could not convert locale name "fr_FR@euro" to language tag:
U_ILLEGAL_ARGUMENT_ERROR

fr_FR@euro is a libc locale name 

$ locale -a|grep fr_FR
fr_FR
fr_FR@euro
fr_FR.iso88591
fr_FR.iso885915@euro
fr_FR.utf8

I understand that fr_FR@euro is taken as an ICU locale name, with the idea
that the locale
syntax being more or less compatible between both providers, this should work
smoothly.  0003 seems to go further in the interpretation and fail on it.
TBH the assumption that it's OK to feed libc locale names to ICU feels quite
uncomfortable.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

2023-04-25 Thread Jeff Davis

On Fri, 2023-04-21 at 22:35 +0100, Andrew Gierth wrote:
> > > > > 
> Can lc_collate_is_c() be taught to check whether an ICU locale is
> using
> POSIX collation?

Attached are a few small patches:

  0001: don't convert C to en-US-u-va-posix
  0002: handle locale C the same regardless of the provider, as you
suggest above
  0003: make LOCALE (or --locale) apply to everything including ICU

As far as I can tell, any libc locale has a reasonable match in ICU, so
setting LOCALE to either C or a libc locale name should be fine. Some
locales are only valid in ICU, e.g. '@colStrength=primary', or a
language tag representation, so if you do something like:

  create database foo locale 'en_US@colStrenghth=primary'
template template0;

You'll get a decent error like:

  ERROR:  invalid LC_COLLATE locale name: "en_US@colStrenghth=primary"
  HINT:  If the locale name is specific to ICU, use ICU_LOCALE.

Overall, I think it works out nicely. Let me know if there are still
some confusing cases. I tried a few variations and this one seemed the
best, but I may have missed something.

Regards,
Jeff Davis

From c768e040dc92b033e4eb0e69f08b59d8d1ffe1e4 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH v2 1/3] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.

The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.

The user can still directly specify 'en-US-u-va-posix' if desired.

Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com
---
 src/backend/utils/adt/pg_locale.c | 19 +--
 src/bin/initdb/initdb.c   | 17 +
 .../regress/expected/collate.icu.utf8.out |  8 
 src/test/regress/sql/collate.icu.utf8.sql |  4 
 4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 51df570ce9..58c4c426bc 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
 {
 #ifdef USE_ICU
 	UErrorCode	 status;
-	char		 lang[ULOC_LANG_CAPACITY];
 	char		*langtag;
 	size_t		 buflen = 32;	/* arbitrary starting buffer size */
 	const bool	 strict = true;
 
-	status = U_ZERO_ERROR;
-	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
-	{
-		if (elevel > 0)
-			ereport(elevel,
-	(errmsg("could not get language from locale \"%s\": %s",
-			loc_str, u_errorName(status;
-		return NULL;
-	}
-
-	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
-	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
-		return pstrdup("en-US-u-va-posix");
-
 	/*
 	 * A BCP47 language tag doesn't have a clearly-defined upper limit
 	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str)
 
 	/* check for special language name */
 	if (strcmp(lang, "") == 0 ||
-		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
-		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
 		found = true;
 
 	/* search for matching language within ICU */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..4086834458 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str)
 {
 #ifdef USE_ICU
 	UErrorCode	 status;
-	char		 lang[ULOC_LANG_CAPACITY];
 	char		*langtag;
 	size_t		 buflen = 32;	/* arbitrary starting buffer size */
 	const bool	 strict = true;
 
-	status = U_ZERO_ERROR;
-	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
-	{
-		pg_fatal("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status));
-		return NULL;
-	}
-
-	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
-	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
-		return pstrdup("en-US-u-va-posix");
-
 	/*
 	 * A BCP47 language tag doesn't have a clearly-defined upper limit
 	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str)
 
 	/* check for special language name */
 	if (strcmp(lang, "") == 0 ||
-		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
-		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
 		found = true;
 
 	/* search for matching language within ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..99f12d2e73 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 --

Re: Order changes in PG16 since ICU introduction

2023-04-25 Thread Tom Lane

"Daniel Verite"  writes:
> FTR the full text search parser still uses the libc functions
> is[w]space/alpha/digit...  that depend on lc_ctype, whether the db
> collation provider is ICU or not.

Yeah, those aren't even connected up to the collation-selection
mechanisms; lots of work to do there.  I wonder if they could be
made to use regc_pg_locale.c instead of duplicating logic.

regards, tom lane

Re: Order changes in PG16 since ICU introduction

2023-04-25 Thread Daniel Verite

Jeff Davis wrote:

> > (I'm not sure whether those operations can get redirected to ICU
> > today
> > or whether they still always go to libc, but we'll surely want to fix
> > it eventually if the latter is still true.)
> 
> Those operations do get redirected to ICU today. 

FTR the full text search parser still uses the libc functions
is[w]space/alpha/digit...  that depend on lc_ctype, whether the db
collation provider is ICU or not.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

2023-04-24 Thread Jeff Davis

On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote:
> I think I might like this idea, except for one thing: you're
> imagining
> that the locale doesn't control anything except string comparisons.
> What about to_upper/to_lower, character classifications in regexes,
> etc?

If provider='libc' and LC_CTYPE='C', str_toupper/str_tolower are
handled with asc_tolower/asc_toupper. The regex character
classification is done with pg_char_properties. In these cases neither
ICU nor libc is used; it's just code in postgres.

libc is special in that you can set LC_COLLATE and LC_CTYPE separately,
so that different locales are used for sorting and character
classification. That's potentially useful to set LC_COLLATE to C for
performance reasons, while setting LC_CTYPE to a useful locale. We
don't allow ICU to set collation and ctype separately (it would be
possible to allow it, but I don't think there's a huge demand and it's
arguably inconsistent to set them differently).

> (I'm not sure whether those operations can get redirected to ICU
> today
> or whether they still always go to libc, but we'll surely want to fix
> it eventually if the latter is still true.)

Those operations do get redirected to ICU today. There are extensions
that call locale-sensitive libc functions directly, and obviously those
won't use ICU.

> Aside from the user-surprise issues discussed up to now, pg_dump
> scripts
> emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
> clauses in CREATE DATABASE, and people are going to be very unhappy
> if that means they suddenly get totally different locale semantics
> after restoring into a new DB.

Agreed.

>   I think we need some plan for mapping
> libc-style locale specs into ICU locales so that we can make that
> more nearly transparent.

ICU does a reasonable job mapping libc-like locale names to ICU
locales, e.g. en_US to en-US, etc. The ordering semantics aren't
guaranteed to be the same, of course (because the libc-locales are
platform-dependent), but it's at least conceptually the same locale.

> 
> Maybe this means we are not ready to do ICU-by-default in v16.
> It certainly feels like there might be more here than we want to
> start designing post-feature-freeze.

This thread is already on the Open Items list. As long as it's not too
disruptive to others I'll leave it as-is for now to see how this sorts
out. Right now it's not clear to me how much of this is a v15 issue vs
a v16 issue.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-04-24 Thread Peter Eisentraut


On 22.04.23 01:00, Jeff Davis wrote:

On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:

And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix"


I already expressed, on reflection, that we should probably just not do
that. So I think we're in agreement on this point; patch attached.


This makes sense to me.  This way, if someone specifies 'C' locale 
together with ICU provider they get an error.  They can then choose to 
use the libc provider, to get the performance path, or stick with ICU by 
using the native spelling of the locale.

Re: Order changes in PG16 since ICU introduction

2023-04-24 Thread Peter Eisentraut


On 21.04.23 19:14, Peter Eisentraut wrote:

On 21.04.23 19:09, Sandro Santilli wrote:

On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:

"Regina Obe"  writes:


https://trac.osgeo.org/postgis/ticket/5375


If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.


Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :

   =# select version();
   PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit

   =# show lc_collate;
   C
   =# select '+' < '-';
   f


If the database is created with locale provider ICU, then lc_collate 
does not apply here, so the result might be correct (depending on what 
locale you have set).


The GUC settings lc_collate and lc_ctype are from a time when those 
locale settings were cluster-global.  When we made those locale settings 
per-database (PG 8.4), we kept them as read-only.  As of PG 15, you can 
use ICU as the per-database locale provider, so what is being attempted 
in the above example is already meaningless before PG 16, since you need 
to look into pg_database to find out what is really happening.


I think we should just remove the GUC parameters lc_collate and lc_ctype.

Re: Order changes in PG16 since ICU introduction

2023-04-21 Thread Robert Haas

On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis  wrote:
> Most of the complaints seem to be complaints about v15 as well, and
> while those complaints may be a reason to not make ICU the default,
> they are also an argument that we should continue to learn and try to
> fix those issues because they exist in an already-released version.
> Leaving it the default for now will help us fix those issues rather
> than hide them.
>
> It's still early, so we have plenty of time to revert the initdb
> default if we need to.

That's fair enough, but I really think it's important that some energy
get invested in providing adequate documentation for this stuff. Just
patching the code is not enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:
> And the fact that "C" or "POSIX" gets transformed into
> "en-US-u-va-posix"

I already expressed, on reflection, that we should probably just not do
that. So I think we're in agreement on this point; patch attached.

Regards,
Jeff Davis

From 3d2791af0a236cbc7ce7f29d988e8ac7fd3fd389 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.

The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.

The user can still directly specify 'en-US-u-va-posix' if desired.

Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.ca...@j-davis.com
---
 src/backend/utils/adt/pg_locale.c | 19 +--
 src/bin/initdb/initdb.c   | 17 +
 .../regress/expected/collate.icu.utf8.out |  8 
 src/test/regress/sql/collate.icu.utf8.sql |  4 
 4 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 51df570ce9..58c4c426bc 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
 {
 #ifdef USE_ICU
 	UErrorCode	 status;
-	char		 lang[ULOC_LANG_CAPACITY];
 	char		*langtag;
 	size_t		 buflen = 32;	/* arbitrary starting buffer size */
 	const bool	 strict = true;
 
-	status = U_ZERO_ERROR;
-	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
-	{
-		if (elevel > 0)
-			ereport(elevel,
-	(errmsg("could not get language from locale \"%s\": %s",
-			loc_str, u_errorName(status;
-		return NULL;
-	}
-
-	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
-	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
-		return pstrdup("en-US-u-va-posix");
-
 	/*
 	 * A BCP47 language tag doesn't have a clearly-defined upper limit
 	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str)
 
 	/* check for special language name */
 	if (strcmp(lang, "") == 0 ||
-		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
-		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
 		found = true;
 
 	/* search for matching language within ICU */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..4086834458 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str)
 {
 #ifdef USE_ICU
 	UErrorCode	 status;
-	char		 lang[ULOC_LANG_CAPACITY];
 	char		*langtag;
 	size_t		 buflen = 32;	/* arbitrary starting buffer size */
 	const bool	 strict = true;
 
-	status = U_ZERO_ERROR;
-	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
-	if (U_FAILURE(status))
-	{
-		pg_fatal("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status));
-		return NULL;
-	}
-
-	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
-	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
-		return pstrdup("en-US-u-va-posix");
-
 	/*
 	 * A BCP47 language tag doesn't have a clearly-defined upper limit
 	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str)
 
 	/* check for special language name */
 	if (strcmp(lang, "") == 0 ||
-		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
-		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
 		found = true;
 
 	/* search for matching language within ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..99f12d2e73 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
 SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1034,17 +1035,24 @@ BEGIN
   quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET icu_validation_level;
 RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
 ERROR:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = 'c'); --

RE: Order changes in PG16 since ICU introduction

2023-04-21 Thread Regina Obe

> > My opinion is that the switch to using ICU by default is ill-advised
> > and should be reverted.
> 
> Most of the complaints seem to be complaints about v15 as well, and while
> those complaints may be a reason to not make ICU the default, they are also
> an argument that we should continue to learn and try to fix those issues
> because they exist in an already-released version.
> Leaving it the default for now will help us fix those issues rather than hide
> them.
> 
> It's still early, so we have plenty of time to revert the initdb default if 
> we need
> to.
> 
> Regards,
>   Jeff Davis

I'm fine with that.  Sounds like it wouldn't be too hard to just pull it out at 
the end.

Before this, I didn't even know ICU existed in PG15. My first realization that 
ICU was even a thing was when my PG16 refused to compile without adding my ICU 
path to my pkg-config or putting in --without-icu.

So yah I suspect leaving it in a little bit longer will uncover some more 
issues and won't harm too much.

Thanks,
Regina

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:

> My opinion is that the switch to using ICU by default is ill-advised
> and should be reverted.

Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.

It's still early, so we have plenty of time to revert the initdb
default if we need to.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

> "Jeff" == Jeff Davis  writes:

 >> Is that the right fix, though? (It forces --locale-provider=libc for
 >> the cluster default, which might not be desirable?)

 Jeff> For the "no locale" behavior (memcmp()-based) the provider needs
 Jeff> to be libc. Do you see an alternative?

Can lc_collate_is_c() be taught to check whether an ICU locale is using
POSIX collation?

There's now another bug in that --no-locale no longer does the same
thing as --locale=C (which is its long-established documented behavior).
How should these various options interact? This all seems not well
thought out from a usability perspective, and I think a proper fix
should involve a bit more serious consideration.

-- 
Andrew.

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 22:08 +0100, Andrew Gierth wrote:
> > > > > 
> Is that the right fix, though? (It forces --locale-provider=libc for
> the
> cluster default, which might not be desirable?)

For the "no locale" behavior (memcmp()-based) the provider needs to be
libc. Do you see an alternative?

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

> "Jeff" == Jeff Davis  writes:

 >> Also, somewhere along the line someone broke initdb --no-locale,
 >> which should result in C locale being the default everywhere, but
 >> when I just tested it it picked 'en' for an ICU locale, which is not
 >> the right thing.

 Jeff> Fixed, thank you.

Is that the right fix, though? (It forces --locale-provider=libc for the
cluster default, which might not be desirable?)

-- 
Andrew.

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote:
> Maybe this means we are not ready to do ICU-by-default in v16.
> It certainly feels like there might be more here than we want to
> start designing post-feature-freeze.

I don't see how punting to the next release helps. If the CREATE
DATABASE syntax (and similar issues for createdb and initdb) in v15 is
just too confusing, and we can't find a remedy for v16, then we
probably won't find a remedy for v17 either.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote:
> > > > > 
> Also, somewhere along the line someone broke initdb --no-locale,
> which
> should result in C locale being the default everywhere, but when I
> just
> tested it it picked 'en' for an ICU locale, which is not the right
> thing.

Fixed, thank you.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-04-21 Thread Robert Haas

On Fri, Apr 21, 2023 at 3:25 PM Jeff Davis  wrote:
> I am also having second thoughts about accepting "C" or "POSIX" as an
> ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
> terribly useful (why not just use memcmp()?), it's not fast in my
> measurements (en-US is faster), so maybe it's better to just throw an
> error and tell the user to use C (or provider=none as I suggest
> above)?

I mean, to renew a complaint I've made previously, how the heck is
anyone supposed to understand what's going on here?

We have no meaningful documentation of how to select an ICU locale
that works for you. We have a couple of examples and a suggestion that
you should use BCP 47. But when I asked before for documentation
references, the ones you provided were not clear, basically
incomprehensible. In follow-up discussion, you admitted you'd had to
consult the source code to figure certain things out.

And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix" is also completely documented. That string appears
twice in the code, but zero times in the documentation. There's code
to do it, but users shouldn't have to read code, and it wouldn't help
much if they did, because the code comments don't really explain the
rationale behind this choice either.

I find the fact that people are having trouble here completely
predictable. Of course if people ask for "C" and the system tells them
that it's using "en-US-u-va-posix" instead they're going to be
confused and ask questions, exactly as is happening here. glibc
collations aren't particularly well-documented either, but people have
some experience with, and they can get a list of values that have a
chance of working from /usr/share/locale, and they know what "C"
means. Nobody knows what "en-US-u-va-posix" is. It's not even
Googleable, really, whereas "C locale" is.

My opinion is that the switch to using ICU by default is ill-advised
and should be reverted. The compatibility break isn't worth whatever
advantages ICU may have, the documentation to allow people to
transition to ICU with reasonable effort doesn't exist, and the fact
that within weeks of feature freeze people who know a lot about
PostgreSQL are struggling to get the behavior they want is a really
bad sign.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 13:28 -0400, Tom Lane wrote:
> I am wondering however whether this doesn't mean that all our
> carefully
> coded fast paths for C locale just went down the drain.

The code still exists. You can test it by using the built-in collation
"C" which is correctly specified with collprovider=libc and
collcollate=C.

For my test dataset, ICU 72, glibc 2.35:

  -- ~07s
  explain analyze select t from a order by t collate "C";

  -- ~15s
  explain analyze select t from a order by t collate "en-US-x-icu";

  -- ~21s
  explain analyze select t from a order by t collate "en-US-u-va-posix-
x-icu";

  -- ~34s
  explain analyze select t from a order by t collate "en_US";

I believe the confusion in this thread comes from:

* The syntax of CREATE DATABASE (the same as v15 but still confusing)
* The fact that you need provider=libc to get memcmp() behavior (same
as v15 but still confusing)

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 21:14 +0200, Sandro Santilli wrote:
> And then runs:
> 
>   createdb --encoding=UTF-8 --template=template0 --lc-collate=C 
> 
> Should we tweak anything else to make the results predictable ?

You can specify --locale-provider=libc

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

Jeff Davis  writes:
> I have a couple ideas:

> 1. Introduce a "none" provider to separate the concept of C/POSIX
> locales from the libc provider. It's not really using a provider
> anyway, it's just using memcmp(), and I think it causes confusion to
> combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
> "LOCALE_PROVIDER=libc LOCALE='C'".

I think I might like this idea, except for one thing: you're imagining
that the locale doesn't control anything except string comparisons.
What about to_upper/to_lower, character classifications in regexes, etc?
(I'm not sure whether those operations can get redirected to ICU today
or whether they still always go to libc, but we'll surely want to fix
it eventually if the latter is still true.)

In any case, that seems somewhat orthogonal to what we're on about here,
which is making the behavior of CREATE DATABASE less surprising and more
backwards-compatible.  I'm not sure that provider=none can help with that.
Aside from the user-surprise issues discussed up to now, pg_dump scripts
emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
clauses in CREATE DATABASE, and people are going to be very unhappy
if that means they suddenly get totally different locale semantics
after restoring into a new DB.  I think we need some plan for mapping
libc-style locale specs into ICU locales so that we can make that
more nearly transparent.

> 2. Change the CREATE DATABASE syntax to catch these errors better at
> the possible expense of backwards compatibility.

That is the exact opposite of what I think we need.  Backwards
compatibility isn't optional.

Maybe this means we are not ready to do ICU-by-default in v16.
It certainly feels like there might be more here than we want to
start designing post-feature-freeze.

regards, tom lane

Re: Order changes in PG16 since ICU introduction

On Fri, 2023-04-21 at 14:23 -0400, Tom Lane wrote:
> postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8'
> LOCALE = 'C';

...

>  test1 | postgres | UTF8 | icu | C  |
> C  | en-US  |   | 
> (4 rows)
> 
> Looks like the "pick en-US even when told not to" problem exists here
> too.

Both provider (ICU) and the icu locale (en-US) are inherited from
template0. The LOCALE parameter to CREATE DATABASE doesn't affect
either of those things, because there's a separate parameter
ICU_LOCALE.

This happens the same way in v15, and although it matches the
documentation technically, it is not a great user experience.

I have a couple ideas:

1. Introduce a "none" provider to separate the concept of C/POSIX
locales from the libc provider. It's not really using a provider
anyway, it's just using memcmp(), and I think it causes confusion to
combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
"LOCALE_PROVIDER=libc LOCALE='C'".

2. Change the CREATE DATABASE syntax to catch these errors better at
the possible expense of backwards compatibility.

I am also having second thoughts about accepting "C" or "POSIX" as an
ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
terribly useful (why not just use memcmp()?), it's not fast in my
measurements (en-US is faster), so maybe it's better to just throw an
error and tell the user to use C (or provider=none as I suggest
above)? 

Obviously the user could manually type "en-US-u-va-posix" if that's the
locale they want. Throwing an error would be a backwards-compatibility
issue, but in v15 an ICU locale of "C" just gives the root locale
anyway, which is probably not what they want.

Regards,
Jeff Davis

Re: Order changes in PG16 since ICU introduction

2023-04-21 Thread Sandro Santilli

On Fri, Apr 21, 2023 at 10:27:49AM -0700, Jeff Davis wrote:
> On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote:
> >   =# select version();
> >   PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> > 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
> >   =# show lc_collate;
> >   C
> >   =# select '+' < '-';
> >   f
> 
> What is the result of:
> 
>   select datlocprovider, datcollate, daticulocale
> from pg_database where datname=current_database();

datlocprovider | i
datcollate | C
daticulocale   | en-US

--strk;

Re: Order changes in PG16 since ICU introduction

> "Tom" == Tom Lane  writes:

 >> Also, somewhere along the line someone broke initdb --no-locale,
 >> which should result in C locale being the default everywhere, but
 >> when I just tested it it picked 'en' for an ICU locale, which is not
 >> the right thing.

 Tom> Confirmed:

 Tom> $ LANG=en_US.utf8 initdb --no-locale
 Tom> The files belonging to this database system will be owned by user 
"postgres".
 Tom> This user must also own the server process.

 Tom> Using default ICU locale "en_US".
 Tom> Using language tag "en-US" for ICU locale "en_US".
 Tom> The database cluster will be initialized with this locale configuration:
 Tom>   provider:icu
 Tom>   ICU locale:  en-US
 Tom>   LC_COLLATE:  C
 Tom>   LC_CTYPE:C
 Tom>   ...

 Tom> That needs to be fixed: --no-locale should prevent any
 Tom> consideration of initdb's LANG/LC_foo environment.

Would it also not make sense to also take into account any --locale and
--lc-* options before choosing an ICU default locale? Right now if you
do, say, initdb --locale=fr_FR you get an ICU locale based on the
environment but lc_* settings based on the option, which seems maximally
confusing.

Also, what happens now to lc_collate_is_c() when the provider is ICU? Am
I missing something, or is it never true now, even if you specified C /
POSIX / en-US-u-va-posix as the ICU locale? This seems like it could be
an important pessimization.

Also also, we now have the problem that it is much harder to create a
'C' collation database within an existing cluster (e.g. for testing)
without knowing whether the default provider is ICU. In the past one
would have done:

CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';

but now that creates a database that uses the same ICU locale as
template0 by default. If instead one tries:

CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' 
ICU_LOCALE='C';

then one gets an error if the default locale provider is _not_ ICU. The
only option now seems to be:

CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' 
LOCALE_PROVIDER = 'libc';

which of course doesn't work in older pg versions.

-- 
Andrew.

Re: Order changes in PG16 since ICU introduction

2023-04-21 Thread Sandro Santilli

On Fri, Apr 21, 2023 at 07:14:13PM +0200, Peter Eisentraut wrote:
> On 21.04.23 19:09, Sandro Santilli wrote:
> > On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
> > > "Regina Obe"  writes:
> > > 
> > > > https://trac.osgeo.org/postgis/ticket/5375
> > > 
> > > If they actually are using locale C, I would say this is a bug.
> > > That should designate memcmp sorting and nothing else.
> > 
> > Sounds like a bug to me. This is happening with a PostgreSQL cluster
> > created and served by a build of commit c04c6c5d6f :
> > 
> >=# select version();
> >PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> > 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
> >=# show lc_collate;
> >C
> >=# select '+' < '-';
> >f
> 
> If the database is created with locale provider ICU, then lc_collate does
> not apply here, so the result might be correct (depending on what locale you
> have set).

The database is created by a perl script which starts like this:

  $ENV{"LC_ALL"} = "C";
  $ENV{"LANG"} = "C";

And then runs:

  createdb --encoding=UTF-8 --template=template0 --lc-collate=C 

Should we tweak anything else to make the results predictable ?

--strk;

Re: Order changes in PG16 since ICU introduction

"Regina Obe"  writes:
> CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
> Doesn't seem to work at least not under mingw64 anyway.

Hmm, doesn't work for me either:

$ LANG=en_US.utf8 initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
  provider:icu
  ICU locale:  en-US
  LC_COLLATE:  en_US.utf8
  LC_CTYPE:en_US.utf8
  LC_MESSAGES: en_US.utf8
  LC_MONETARY: en_US.utf8
  LC_NUMERIC:  en_US.utf8
  LC_TIME: en_US.utf8
  ...
$ psql postgres
psql (16devel)
Type "help" for help.

postgres=# SELECT '+'  <  '-'  ;
 ?column? 
--
 f
(1 row)

(as expected, so far)

postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 
'C';
CREATE DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "postgres".
test1=# SELECT '+'  <  '-'  ;
 ?column? 
--
 f
(1 row)

(wrong!)

test1=# \l
  List of databases
   Name|  Owner   | Encoding | Locale Provider |  Collate   |   Ctype| 
ICU Locale | ICU Rules |   Access privileges   
---+--+--+-++++---+---
 postgres  | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | 
en-US  |   | 
 template0 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | 
en-US  |   | =c/postgres  +
   |  |  | |||  
  |   | postgres=CTc/postgres
 template1 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | 
en-US  |   | =c/postgres  +
   |  |  | |||  
  |   | postgres=CTc/postgres
 test1 | postgres | UTF8 | icu | C  | C  | 
en-US  |   | 
(4 rows)

Looks like the "pick en-US even when told not to" problem exists here too.

regards, tom lane

RE: Order changes in PG16 since ICU introduction

2023-04-21 Thread Regina Obe

> Yeah.  My recommendation is just LOCALE:
> 
> regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING =
> 'UTF8' LOCALE = 'C'; CREATE DATABASE regression=# CREATE DATABASE test2
> TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C';
> NOTICE:  using standard form "en-US-u-va-posix" for locale "C"
> CREATE DATABASE
> 
> I think it's probably intentional that ICU_LOCALE is stricter about being
given
> a real ICU locale name, but I didn't write any of that code.
> 
>   regards, tom lane

CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';

Doesn't seem to work at least not under mingw64 anyway.

SELECT '+'  <  '-'  ;

Returns false

Re: Order changes in PG16 since ICU introduction

Andrew Gierth  writes:
> "Peter" == Peter Eisentraut  writes:
>  Peter> If the database is created with locale provider ICU, then
>  Peter> lc_collate does not apply here,

> Having lc_collate return a value which is silently being ignored seems
> to me rather hugely confusing.

It's not *completely* ignored --- there are bits of code that are not
yet ICU-ified and will still use the libc facilities.  So we can't
get rid of those options yet, even in an ICU-based database.

> Also, somewhere along the line someone broke initdb --no-locale, which
> should result in C locale being the default everywhere, but when I just
> tested it it picked 'en' for an ICU locale, which is not the right
> thing.

Confirmed:

$ LANG=en_US.utf8 initdb --no-locale
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
  provider:icu
  ICU locale:  en-US
  LC_COLLATE:  C
  LC_CTYPE:C
  ...

That needs to be fixed: --no-locale should prevent any consideration
of initdb's LANG/LC_foo environment.

regards, tom lane

Re: Order changes in PG16 since ICU introduction

> "Peter" == Peter Eisentraut  writes:

 Peter> If the database is created with locale provider ICU, then
 Peter> lc_collate does not apply here,

Having lc_collate return a value which is silently being ignored seems
to me rather hugely confusing.

Also, somewhere along the line someone broke initdb --no-locale, which
should result in C locale being the default everywhere, but when I just
tested it it picked 'en' for an ICU locale, which is not the right
thing.

-- 
Andrew (irc:RhodiumToad)

Re: Order changes in PG16 since ICU introduction