Re: pg_collation.collversion for C.UTF-8

2023-06-22 Thread Thomas Munro
On Tue, Jun 20, 2023 at 6:48 AM Jeff Davis wrote: > On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote: > > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as > > > LC_COLLATE=C, > > > but leave LC_CTYPE=C.UTF-8 as-is? > > > > Yes. The basic idea, at least for these two OSes, is that

Re: pg_collation.collversion for C.UTF-8

2023-06-21 Thread Daniel Verite
Thomas Munro wrote: > What could we do that would be helpful here, without affecting users > of the "true" C.UTF-8 for the rest of time? This is a Debian (+ > downstream distro) only problem as far as we know so far, and only > for Debian 11 and older. It seems to include RedHat-based

Re: pg_collation.collversion for C.UTF-8

2023-06-19 Thread Jeff Davis
On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote: > > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as > > LC_COLLATE=C, > > but leave LC_CTYPE=C.UTF-8 as-is? > > Yes.  The basic idea, at least for these two OSes, is that every > category behaves as if set to C, except LC_CTYPE. If

Re: pg_collation.collversion for C.UTF-8

2023-06-16 Thread Thomas Munro
On Sat, Jun 17, 2023 at 10:03 AM Jeff Davis wrote: > On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote: > > Hmm, OK let's explore that. What could we do that would be helpful > > here, without affecting users of the "true" C.UTF-8 for the rest of > > time? > > Where is the "true" C.UTF-8

Re: pg_collation.collversion for C.UTF-8

2023-06-16 Thread Jeff Davis
On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote: > Hmm, OK let's explore that.  What could we do that would be helpful > here, without affecting users of the "true" C.UTF-8 for the rest of > time? Where is the "true" C.UTF-8 defined? I assume you mean that the collation order can't

Re: pg_collation.collversion for C.UTF-8

2023-06-15 Thread Thomas Munro
On Sun, Apr 23, 2023 at 5:22 AM Daniel Verite wrote: > I understand that my proposal to version C.* like any other collation > might be erring on the side of caution, but ignoring these collation > changes on at least one major OS does not feel right either. > Maybe we should consider doing

Re: pg_collation.collversion for C.UTF-8

2023-06-07 Thread Jeff Davis
On Wed, 2023-06-07 at 23:28 +0200, Peter Eisentraut wrote: > On 06.06.23 21:23, Jeff Davis wrote: > > What about ICU? How should provider=icu locale=C.UTF-8 behave? We > > could: > > It should be an error. > > > a. Just pass it to the provider and see what happens (older > > versions of > > ICU

Re: pg_collation.collversion for C.UTF-8

2023-06-07 Thread Peter Eisentraut
On 06.06.23 21:23, Jeff Davis wrote: What about ICU? How should provider=icu locale=C.UTF-8 behave? We could: It should be an error. a. Just pass it to the provider and see what happens (older versions of ICU would interpret it as en-US-u-va-posix; newer versions would give the root locale).

Re: pg_collation.collversion for C.UTF-8

2023-06-07 Thread Daniel Verite
I wrote: > Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we > only match ASCII characters 0-9, or 10 codepoints. With > "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the > digit characters in all languages, plus a bunch of variants for > mathematical

Re: pg_collation.collversion for C.UTF-8

2023-06-07 Thread Daniel Verite
Jeff Davis wrote: > What about ICU? How should provider=icu locale=C.UTF-8 behave? We > could: > > a. Just pass it to the provider and see what happens (older versions of > ICU would interpret it as en-US-u-va-posix; newer versions would give > the root locale). > > b. Consistently

Re: pg_collation.collversion for C.UTF-8

2023-06-06 Thread Joe Conway
On 6/6/23 15:23, Jeff Davis wrote: On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote: But in the meantime, personally I don't quite see why Postgres should start forcing C.UTF-8 to sort differently in the database than in the OS. I can see both points of view. It could be surprising to

Re: pg_collation.collversion for C.UTF-8

2023-06-06 Thread Jeff Davis
On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote: > But in the meantime, personally I don't quite see why Postgres should > start forcing C.UTF-8 to sort differently in the database than in the > OS. I can see both points of view. It could be surprising to users if C.UTF-8 does not sort

Re: pg_collation.collversion for C.UTF-8

2023-06-05 Thread Daniel Verite
Jeff Davis wrote: > > For libc: this change may affect any user who happened to have > > LANG=C.UTF-8 in their environment at initdb time, which is probably a > > lot of users, and some buildfarm members. However, the average risk > > seems to be much lower, because we've gone a long

Re: pg_collation.collversion for C.UTF-8

2023-06-05 Thread Jeff Davis
On Fri, 2023-05-26 at 10:43 -0700, Jeff Davis wrote: > We still need to consider backwards compatibility. If someone has a > collation with locale name C.UTF-8 in an earlier version, any change > to > the interpretation of that locale name after an upgrade carries a > corruption risk. The risks

Re: pg_collation.collversion for C.UTF-8

2023-05-26 Thread Jeff Davis
On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote: > Jeff Davis writes: > > What should we do with locales like C.UTF-8 in both libc and ICU? > > I vote for passing those to the existing C-specific code paths, Great, this would be a big step toward solving the ICU usability issues in this

Re: pg_collation.collversion for C.UTF-8

2023-05-25 Thread Tom Lane
Jeff Davis writes: > What should we do with locales like C.UTF-8 in both libc and ICU? I vote for passing those to the existing C-specific code paths, whereever we have any (not sure that we do for functionality). The semantics are quite well-defined and I can see no good coming of allowing

Re: pg_collation.collversion for C.UTF-8

2023-05-25 Thread Jeff Davis
On Wed, 2023-04-19 at 14:07 +1200, Thomas Munro wrote: > That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied > by > the glibc project) isn't supposed to be versioned, but it's extremely > unfortunate that a bunch of OSes (Debian and maybe more) have been > sorting text in some

Re: pg_collation.collversion for C.UTF-8

2023-04-22 Thread Daniel Verite
Thomas Munro wrote: > It looks like for technical reasons > inside glibc, that couldn't be done before 2.35: > > https://sourceware.org/bugzilla/show_bug.cgi?id=17318 > > That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied > by the glibc project) isn't supposed to be

Re: pg_collation.collversion for C.UTF-8

2023-04-18 Thread Thomas Munro
On Wed, Apr 19, 2023 at 1:30 PM Jeff Davis wrote: > On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote: > > Many OSes have a locale with this name. I don't know this history, > > who did it first etc, but now I am wondering if they all took the > > "obvious" interpretation, that it should be

Re: pg_collation.collversion for C.UTF-8

2023-04-18 Thread Jeff Davis
On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote: > Many OSes have a locale with this name.  I don't know this history, > who did it first etc, but now I am wondering if they all took the > "obvious" interpretation, that it should be code-point based, > extrapolating from "C" (really memcmp

Re: pg_collation.collversion for C.UTF-8

2023-04-18 Thread Thomas Munro
On Wed, Apr 19, 2023 at 12:36 AM Daniel Verite wrote: > This seems to be based on the idea that C.* collations provide an > immutable sort like "C", but it appears that it's not the case. Hmm. It seems I added that exemption initially for FreeBSD only in ca051d8b101, and then merged the cases

pg_collation.collversion for C.UTF-8

2023-04-18 Thread Daniel Verite
Hi, get_collation_actual_version() in pg_locale.c currently excludes C.UTF-8 (and more generally C.*) from versioning, which makes pg_collation.collversion being empty for these collations. char * get_collation_actual_version(char collprovider, const char *collcollate) { if