subject:"Re\: Collation version tracking for macOS"

Re: Collation version tracking for macOS

2024-02-13 Thread Thomas Munro

On Tue, Feb 13, 2024 at 9:25 AM Jeff Davis  wrote:
> On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote:
> > "icu_multilib must be loaded via shared_preload_libraries.
> > icu_multilib ignores any ICU library with a major version greater
> > than
> > that with which PostgreSQL was built."
> >
> > It's not clear from reading this whether the second sentence here is
> > a
> > regrettable implementation restriction or design behavior. If it's
> > design behavior, what's the point of it?
>
> That restriction came from Thomas's (uncommitted) work on the same
> problem. I believe the reasoning was that we don't know whether future
> versions of ICU might break something that we're doing, though perhaps
> there's a better way.

Right, to spell that out more fully:  We compile and link against one
particular ICU library that is present at compile time, and there is a
place in that multi-lib patch that assigns the function pointers from
that version to variables of the function pointer type that we expect.
Compilation would fail if ICU ever changed relevant function
prototypes in a future release, and then we'd have to come up with
some trampoline/wrapper scheme to wallpaper over differences.  That's
why I think it's safe to use dlsym() to access function pointers for
versions up to and including the one whose headers we were compiled
against, but not later ones which we haven't tested in that way.

Sadly I won't be able to work on multi-lib ICU support again in this
cycle.  I think we managed to prove that dlopen works for this, and
learn some really interesting stuff about Unicode and ICU evolution,
but we still have to come up with the right model, catalogues and DDL
etc, for a nice user experience.  What I was most recently
experimenting with based on earlier discussions was the idea of
declaring separate providers: icu72 and icu68 could both exist and you
could create extra indexes and then drop the originals as a
no-downtime upgrade path.  I have a pet theory that you could usefully
support multi-version libc locales too if you're prepared to make
certain assumptions (short version: take the collation definition
files from any older version of your OS, compile with newer version's
localedef, give it a name like "en_US@ubuntu18", and assume/pray they
didn't change stuff that wasn't expressed in the definition file), so
I was working on a generalisation slightly wider than just
multi-version ICU.

Re: Collation version tracking for macOS

2024-02-12 Thread Robert Haas

On Tue, Feb 13, 2024 at 1:55 AM Jeff Davis  wrote:
> Postgres can and does latch on to the version of ICU it was compiled
> against. It's a normal shared library dependency.
>
> The problem is that databases -- and the file structures -- outlive a
> particular version of Postgres. So if Postgres 16 is compiled against
> ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade
> from 16 to 17? Postgres 17 will try to access the old file structures
> using ICU Y, and they'll be corrupt.
>
> What we want is the file structures that depend on ICU X to continue to
> find ICU X even after you upgrade to Postgres 17, yet allow new
> structures to be created using ICU Y. In other words, "multi-lib",
> meaning that the same Postgres binary is linking to multiple versions
> of ICU and the different versions for different structures. That would
> allow users to recreate one index at a time to use ICU Y, until nothing
> depends on ICU X any longer.

Ah, I see. At least, I think I do. I think some of this material could
be very usefully included into the first section of the doc you're
trying to write. What you say here makes it a lot easier to grasp the
motivation and use case for this code, at least for me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2024-02-12 Thread Jeff Davis

On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote:
> 1. Here's what we think your OS package manager is probably going to
> do.
> 2. That's going to interact with PostgreSQL in this way that I will
> now describe.
> 3. See, that sucks, because of the stuff I said above about needing
> stable collations!
> 4. But if you installed this module instead, then you could prevent
> the things I said under #2 from happening.
> 5. Instead, you'd get this other behavior, which would make you
> happy.

I like that framing, thank you. I'll try to come up with something
there.

> I feel like I can almost piece together in my head how this is
> supposed to work -- I think it's like "we expect the OS package
> manager to drop all the ICU versions in the same directory via side
> by
> side installs, and that works well for other programs because ... for
> some mysterious reason they can latch onto the specific version they
> were linked against ... but we can't or don't do that because ... I
> guess we're dumber than those other pieces of software or
> something ... 

Postgres can and does latch on to the version of ICU it was compiled
against. It's a normal shared library dependency.

The problem is that databases -- and the file structures -- outlive a
particular version of Postgres. So if Postgres 16 is compiled against
ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade
from 16 to 17? Postgres 17 will try to access the old file structures
using ICU Y, and they'll be corrupt.

What we want is the file structures that depend on ICU X to continue to
find ICU X even after you upgrade to Postgres 17, yet allow new
structures to be created using ICU Y. In other words, "multi-lib",
meaning that the same Postgres binary is linking to multiple versions
of ICU and the different versions for different structures. That would
allow users to recreate one index at a time to use ICU Y, until nothing
depends on ICU X any longer.

I should say this is not an easy process even if something like
icu_multilib is available. We don't have all of the information needed
in the catalog to track which structures depend on which versions of a
collation library, collation library version is itself not easy to
define, and it still involves rebuilding (or at least re-validating) a
lot of structures. This is a "make hard things possible" tool, and I
suspect only a handful of users would use it successfully to migrate to
new ICU versions.

More simply, some users might just want to lock down the version of ICU
to X, and just use that forever until they have a reason to change it.
icu_multilib can also facilitate that, though it's still not trivial.

> "icu_multilib must be loaded via shared_preload_libraries.
> icu_multilib ignores any ICU library with a major version greater
> than
> that with which PostgreSQL was built."
> 
> It's not clear from reading this whether the second sentence here is
> a
> regrettable implementation restriction or design behavior. If it's
> design behavior, what's the point of it?

That restriction came from Thomas's (uncommitted) work on the same
problem. I believe the reasoning was that we don't know whether future
versions of ICU might break something that we're doing, though perhaps
there's a better way.

Regards,
Jeff Davis

Re: Collation version tracking for macOS

2024-02-11 Thread Robert Haas

On Sun, Feb 4, 2024 at 10:42 PM Jeff Davis  wrote:
> I'm hesitant to put much more work into it (e.g. new patches, etc.)
> without more feedback. Your opinion would certainly be valuable -- for
> instance, when reading the docs, can you imagine yourself actually
> using this if you ran into a collation versioning/migration problem?

I'm having some difficulty understanding what the docs are trying to
tell me. I think there are some issues with ordering and pacing.

"The icu_multilib module provides control over the version (or
versions) of the ICU provider library used by PostgreSQL, which can be
different from the version of ICU with which it was built. Collations
are a product of natural language, and natural language evolves over
time; but PostgreSQL depends on stable ordering for structures such as
indexes. Newer versions of ICU update the provided collators to adapt
to changes in natural language, so it's important to control when and
how those new versions of ICU are used to prevent problems such as
index corruption."

Check. So far, so good.

"This module assumes that the necessary versions of ICU are already
available, such as through the operating system's package manager; and
already properly installed in a single location accessible to
PostgreSQL. The configration variable icu_multilib.library_path should
be set to the location where these ICU library versions are
installed."

Here I feel we've skipped a few steps. I suggest postponing all
discussion of specific GUCs to a later point -- specifically the
configuration parameters section, which I think should actually be
F.19.1, with the use cases following that rather than preceding it. In
this introductory section, I suggest elaborating a bit more on what
problem we're trying to solve at a conceptual level. It feels like
we've gone straight from the very general issue (collation definitions
need to be stable but language isn't) to very specific (here's a GUC
that you can set to a pathname). I feel like the need for this module
should be more specifically motivated. Maybe something like:

1. Here's what we think your OS package manager is probably going to do.
2. That's going to interact with PostgreSQL in this way that I will
now describe.
3. See, that sucks, because of the stuff I said above about needing
stable collations!
4. But if you installed this module instead, then you could prevent
the things I said under #2 from happening.
5. Instead, you'd get this other behavior, which would make you happy.

I feel like I can almost piece together in my head how this is
supposed to work -- I think it's like "we expect the OS package
manager to drop all the ICU versions in the same directory via side by
side installs, and that works well for other programs because ... for
some mysterious reason they can latch onto the specific version they
were linked against ... but we can't or don't do that because ... I
guess we're dumber than those other pieces of software or
something ... so this module lets you ask for more sensible
behavior." But I think that could be spelled out a bit more clearly
and directly than this document seems to me to do.

I also wonder if we should be explaining why we don't get this right
out of the box. Like, if the normal behavior categorically sucks, why
do you have to install icu_multilib to get something else? Why not
make the multilib treatment the default? And if the normal behavior is
better for some cases and the icu_multilib behavior is better for
other cases, then maybe we ought to explain which one to use in which
scenario.

"icu_multilib must be loaded via shared_preload_libraries.
icu_multilib ignores any ICU library with a major version greater than
that with which PostgreSQL was built."

It's not clear from reading this whether the second sentence here is a
regrettable implementation restriction or design behavior. If it's
design behavior, what's the point of it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2024-02-04 Thread Jeff Davis

On Thu, 2024-02-01 at 15:58 -0500, Robert Haas wrote:
> Not that I'm the most qualified person to have an opinion on this
> topic, but did you intend to attach this stuff to this email, or is
> it
> somewhere else?

The previous patch is here:

https://www.postgresql.org/message-id/6f4a8c01a5cb1edf3a07d204c371fbddaef252f9.camel%40j-davis.com

And I attached the rendered HTML doc page, which conveniently renders
in the archives (thanks to web team -- I didn't know if that would
actually work until I tried it):

https://www.postgresql.org/message-id/attachment/142818/icu-multilib.html

For anyone interested in this work, the docs are the best place to
start.

I'm hesitant to put much more work into it (e.g. new patches, etc.)
without more feedback. Your opinion would certainly be valuable -- for
instance, when reading the docs, can you imagine yourself actually
using this if you ran into a collation versioning/migration problem?

Regards,
Jeff Davis

Re: Collation version tracking for macOS

2024-02-01 Thread Robert Haas

On Sun, Jan 21, 2024 at 1:58 PM Jeff Davis  wrote:
> I rendered the docs I wrote as an HTML page and attached it to this
> thread, to make it easier for others to read and comment. It's
> basically a tool for experts who are willing to devote effort to
> managing their collations and ICU libraries. Is that what we want?
>
> At an implementation level, did I get the extension APIs right? I
> considered making the API simpler, but that would require the extension
> to do quite a bit more work (including a lot of redundant work) to use
> ICU properly.

Not that I'm the most qualified person to have an opinion on this
topic, but did you intend to attach this stuff to this email, or is it
somewhere else?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2024-02-01 Thread vignesh C

On Mon, 22 Jan 2024 at 00:28, Jeff Davis  wrote:
>
> On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote:
> > This thread has been idle for a year now, It has stalled after a lot
> > of discussion.
> > @Jeff Davis: Do you want to try to restart the discussion by posting
> > an updated version and see what happens?
>
> Thank you for following up. Yes, I'd like to find a path forward here,
> but I need some validation from others on my approach.

Let's start by posting a rebased version to fix the CFBot patch apply
issue as in [1]:

=== Applying patches on top of PostgreSQL commit ID
402388946fb3ac54f0fd5944d7e177ef7737eab2 ===
=== applying patch
./v8-0001-Support-multiple-ICU-collation-provider-libraries.patch
patching file src/backend/commands/collationcmds.c
Hunk #1 FAILED at 566.

1 out of 4 hunks FAILED -- saving rejects to file
src/backend/commands/collationcmds.c.rej
patching file src/backend/utils/adt/formatting.c
Hunk #1 succeeded at 1575 (offset 9 lines).
Hunk #2 succeeded at 1587 (offset 9 lines).
Hunk #3 succeeded at 1605 (offset 9 lines).
Hunk #4 succeeded at 1700 (offset 3 lines).
Hunk #5 succeeded at 1819 (offset -1 lines).
Hunk #6 succeeded at 1939 (offset -5 lines).
patching file src/backend/utils/adt/pg_locale.c
Hunk #1 FAILED at 70.
...
Hunk #31 FAILED at 2886.
Hunk #32 FAILED at 2902.
22 out of 32 hunks FAILED -- saving rejects to file
src/backend/utils/adt/pg_locale.c.rej

[1] - http://cfbot.cputube.org/patch_46_3956.log

Regards,
Vignesh

Re: Collation version tracking for macOS

2024-01-21 Thread Jeff Davis

On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote:
> This thread has been idle for a year now, It has stalled after a lot
> of discussion.
> @Jeff Davis: Do you want to try to restart the discussion by posting
> an updated version and see what happens?

Thank you for following up. Yes, I'd like to find a path forward here,
but I need some validation from others on my approach.

I rendered the docs I wrote as an HTML page and attached it to this
thread, to make it easier for others to read and comment. It's
basically a tool for experts who are willing to devote effort to
managing their collations and ICU libraries. Is that what we want?

At an implementation level, did I get the extension APIs right? I
considered making the API simpler, but that would require the extension
to do quite a bit more work (including a lot of redundant work) to use
ICU properly.

Regards,
Jeff Davis

Re: Collation version tracking for macOS

2024-01-19 Thread vignesh C

On Sat, 21 Jan 2023 at 02:24, Jeff Davis  wrote:
>
> On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote:
> > Attached are a new set of patches, including a major enhancement: the
> > icu_multilib contrib module.
>
> Attached rebased v8.
>
> [ It looks like my email client truncated the last email somehow, in
> case someone was wondering why it just stopped. ]
>
> The big change is the introduction of the icu_multilib contrib module
> which provides a lot of the functionality requested in this thread:
>
>* icu version stability, which allows you to "lock down" ICU to a
> specific major and minor version (or major version only)
>* multi-lib ICU, which (if a GUC is set) will enable the "search by
> collversion" behavior. Some doubts were raised about the wisdom of this
> approach, but it's the only multi-lib solution we have without doing
> some significant catalog work.
>
> I rendered the HTML docs for icu_multilib and attached to this email to
> make it easier to view.
>
> icu_multilib assumes that the various ICU library versions are already
> available in a single location, most likely installed with a package
> manager. That location can be the same as the built-in ICU, or a
> different location. Ideally, packagers would start to offer a few
> "stable" versions of ICU that would be available for a long time, but
> it will take a while for that to happen. So for now, it's up to the
> user to figure out how to get the right versions of ICU on their system
> and keep them there.
>
> Automated tests of icu_multilib are a problem unless the one running
> the tests is willing to compile the right versions of ICU (like I did).
> But I at least have automated tests for the hooks by using the test
> module test_collator_lib_hooks.
>
> The v7 patches in this thread are dependent on the pure refactoring
> patches in this CF entry:
>
>https://commitfest.postgresql.org/41/3935/
>
> https://postgr.es/m/052a5ed874d110be2f3ae28752e363306b10966d.ca...@j-davis.com
>
> The requested functionality _not_ offered by icu_multilib is tying a
> specific collation to a specific ICU version. A few variants were
> proposed, the latest is to tie a collation to the library file itself
> through the provider. That needs to be done with proper catalog support
> in core. But I believe the work I've done here has made a lot of
> progress in that direction, and also shows the versatility of the new
> hook to solve at least some problems.

This thread has been idle for a year now, It has stalled after a lot
of discussion.
@Jeff Davis: Do you want to try to restart the discussion by posting
an updated version and see what happens?

Regards,
Vignesh

Re: Collation version tracking for macOS

2023-01-10 Thread Peter Eisentraut


On 05.12.22 22:33, Thomas Munro wrote:

On Tue, Dec 6, 2022 at 6:45 AM Joe Conway  wrote:

On 12/5/22 12:41, Jeff Davis wrote:

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1.  I think we should seriously consider provider = ICU63.  I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work.  Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.


As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

CREATE COLLATION PROVIDER icu63 TYPE icu
  AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.


+1

I like this. The provider kind defines which path we take in our code,
and the specific library unambiguously defines a specific collation
behavior (I think, ignoring bugs?)


OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.


I'm reviewing the commit fest entry 
https://commitfest.postgresql.org/41/3956/, which points to this thread. 
 It appears that the above patch did not come about in time.  The patch 
of record is now Jeff's refactoring patch, which is also tracked in 
another commit fest entry (https://commitfest.postgresql.org/41/4058/). 
So as a matter of procedure, we should probably close this commit fest 
entry for now.  (Maybe we should also use a different thread subject in 
the future.)


I have a few quick comments on the above syntax example:

There is currently a bunch of locale-using code that selects different 
code paths by "collation provider", i.e., a libc-based code path and an 
ICU-based code path (and sometimes also a default provider path).  The 
above proposal would shift the terminology and would probably require 
some churn at those sites, in that they would now have to select by 
"collation provider type".  We could probably avoid that by shifting the 
terms a bit, so instead of the suggested


provider type -> provider

we could use

provider -> version of that provider

(or some other actual term), which would leave the meaning of "provider" 
unchanged as far as locale-using code is concerned.  At least that's my 
expectation, since no code for this has been seen yet.  We should keep 
this in mind in any case.


Also, the above example exposes a lot of operating system level details. 
 This creates issues with dump/restore, which some of the earlier 
patches avoided by using a path-based approach, and it would also 
require some thoughts about permissions.  We probably want 
non-superusers to be able to interact with this system somehow, for 
upgrading (for some meaning of that action) indexes etc. without 
superuser access.  The more stuff from the OS we expose, the more stuff 
we have to be able to lock down again in a usable manner.


(The search-by-collversion approach can probably avoid those issues better.)

Re: Collation version tracking for macOS

2022-12-05 Thread Thomas Munro

On Tue, Dec 6, 2022 at 6:45 AM Joe Conway  wrote:
> On 12/5/22 12:41, Jeff Davis wrote:
> > On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
> >> 1.  I think we should seriously consider provider = ICU63.  I still
> >> think search-by-collversion is a little too magical, even though it
> >> clearly can be made to work.  Of the non-magical systems, I think
> >> encoding the choice of library into the provider name would avoid the
> >> need to add a second confusing "X_version" concept alongside our
> >> existing "X_version" columns in catalogues and DDL syntax, while
> >> still
> >> making it super clear what is going on.
> >
> > As I understand it, this is #2 in your previous list?
> >
> > Can we put the naming of the provider into the hands of the user, e.g.:
> >
> >CREATE COLLATION PROVIDER icu63 TYPE icu
> >  AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';
> >
> > In this model, icu would be a "provider kind" and icu63 would be the
> > specific provider, which is named by the user.
> >
> > That seems like the least magical approach, to me. We need an ICU
> > library; the administrator gives us one that looks like ICU; and we're
> > happy.
>
> +1
>
> I like this. The provider kind defines which path we take in our code,
> and the specific library unambiguously defines a specific collation
> behavior (I think, ignoring bugs?)

OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.

Re: Collation version tracking for macOS

2022-12-05 Thread Joe Conway


On 12/5/22 12:41, Jeff Davis wrote:

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1.  I think we should seriously consider provider = ICU63.  I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work.  Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.


As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

   CREATE COLLATION PROVIDER icu63 TYPE icu
 AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.


+1

I like this. The provider kind defines which path we take in our code, 
and the specific library unambiguously defines a specific collation 
behavior (I think, ignoring bugs?)


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Collation version tracking for macOS

2022-12-05 Thread Jeff Davis

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
> 1.  I think we should seriously consider provider = ICU63.  I still
> think search-by-collversion is a little too magical, even though it
> clearly can be made to work.  Of the non-magical systems, I think
> encoding the choice of library into the provider name would avoid the
> need to add a second confusing "X_version" concept alongside our
> existing "X_version" columns in catalogues and DDL syntax, while
> still
> making it super clear what is going on.

As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

  CREATE COLLATION PROVIDER icu63 TYPE icu
AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.

It avoids a lot of the annoyances we're discussing, and puts the power
in the hands of the admin. If they want to allow minor version updates,
they specify the library with .so.63, and let the symlinking handle it.

Of course, we can still do some sanity checks (WARNINGs or ERRORs) when
we think something is going wrong; like the version of ICU is too new,
or the reported version (ucol_getVersion()) doesn't match what's in
collversion. But we basically get out of the business of understanding
ICU versioning and leave that up to the administrator.

It's easier to document, and would require fewer GUCs (if any). And it
avoids mixing version information from another project into our data
model.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-12-05 Thread Robert Haas

On Sun, Dec 4, 2022 at 10:12 PM Thomas Munro  wrote:
> My tentative votes are:
>
> 1.  I think we should seriously consider provider = ICU63.  I still
> think search-by-collversion is a little too magical, even though it
> clearly can be made to work.  Of the non-magical systems, I think
> encoding the choice of library into the provider name would avoid the
> need to add a second confusing "X_version" concept alongside our
> existing "X_version" columns in catalogues and DDL syntax, while still
> making it super clear what is going on.  This would include adding DDL
> commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
> to make warnings go way.

+1. I wouldn't lose any sleep if we picked a different non-magical
option, but I think this is probably my favorite of the
explicit-library-version options (though it is close) and I like it
better than search-by-collversion.

(It's possible that I'm wrong to like it better, but I do.)

> 2.  I think we should ignore minor versions for now (other than
> reporting them in the relevant introspection functions), but not make
> any choices that would prevent us from changing our mind about that in
> a later release.  For example, having two levels of specificity ICU
> and ICU68  in the libver-in-provider-name design wouldn't preclude us
> from adding support for ICU68_2 later

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2022-12-04 Thread Thomas Munro

On Tue, Nov 29, 2022 at 7:51 PM Jeff Davis  wrote:
> On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:
> > On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro 
> > wrote:
> > > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis 
> > > wrote:
> > > > I'd vote for 1 on the grounds that it's easier to document and
> > > > understand a single collation version, which comes straight from
> > > > ucol_getVersion(). This approach makes it a separate problem to
> > > > find
> > > > the collation version among whatever libraries the admin can
> > > > provide;
> > > > but adding some observability into the search should mitigate any
> > > > confusion.
> > >
> > > OK, it sounds like I should code that up next.
> >
> > Here's the first iteration.
>
> Thank you.

Thanks for the review.  Responses further down.  And thanks also for
the really interesting discussion about how the version numbers work
(or in some cases, don't work...), and practical packaging and linking
problems.

To have a hope of making something happen for PG16, which I think
means we need a serious contender patch in the next few weeks, we
really need to make some decisions.  I enjoyed trying out
search-by-collversion, but it's still not my favourite.  On the ballot
we have two main questions:

1.  Should we commit to search-by-collversion, or one of the explicit
library version ideas, and if the latter, which?
2.  Should we try to support being specific about minor versions (in
various different ways according to the choice made for #1)?

My tentative votes are:

1.  I think we should seriously consider provider = ICU63.  I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work.  Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while still
making it super clear what is going on.  This would include adding DDL
commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
to make warnings go way.

2.  I think we should ignore minor versions for now (other than
reporting them in the relevant introspection functions), but not make
any choices that would prevent us from changing our mind about that in
a later release.  For example, having two levels of specificity ICU
and ICU68  in the libver-in-provider-name design wouldn't preclude us
from adding support for ICU68_2 later

I haven't actually tried that design out in code yet, but I'm willing
to try to code that up very soon.  So no new patch from me yet.  Does
anyone else want to express a view?

> Proposed changes:
>
> * I attached a first pass of some documentation.

Thanks.  Looks pretty good, and much of it would stay if we changed to
one of the other models.

> * Should be another GUC to turn WARNING into an ERROR. Useful at least
> for testing; perhaps too dangerous for production.

OK, will add that into the next version.

> * The libraries should be loaded in a more diliberate order. The "*"
> should be expanded in a descending fashion so that later versions are
> preferred.

Yeah, I agree.

> * GUCs should be validated.

Will do.

> * Should validate that loaded library has expected version.

Will do.

> * We need to revise or remove pg_collation_actual_version() and
> pg_database_collation_actual_version().

I never liked that use of the word "actual"...

> * The GUCs are PGC_SUSET, but don't take effect because
> icu_library_list_fully_loaded is never reset.

True.  Just rought edges because I was trying to prototype
search-by-collversion fast.  Will consider this for the next version.

> * The extra collations you're adding at bootstrap time are named based
> on the library major version. I suppose it might be more "proper" to
> name them based on the collation version, but that would be more
> verbose, so I won't advocate for that. Just pointing it out.

Ah, yes, the ones with names like "en-US-x-icu68".  I agree that made
a little less sense in the search-by-collversion patch.  Maybe we
wouldn't want these at all in the search-by-collversion model.  But I
think they're perfect the way they are in the provider = ICU68 model.
The other idea I considered ages ago was that we could use namespaces:
you could "icu68.en-US", or just "en-US" in some contexts to get what
your search path sees, but that all seemed a little too cute and not
really like anything else we do with system-created catalogues, so I
gave that idea up.

> * It looks hard (or impossible) to mix multiple ICU libraries with the
> same major version and different minor versions. That's because,
> e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63,
> and when you install ICU 63.2, those dependencies get clobbered with
> the 63.2 versions. That fails the sanity check I proposed above about
> the library version number matching the requested library version
> number. And it also just seems

Re: Collation version tracking for macOS

2022-12-01 Thread Dagfinn Ilmari Mannsåker

Jeff Davis  writes:

> On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote:
>> On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:
>> > That is ... astonishingly bad.
>> 
>> https://unicode-org.atlassian.net/browse/CLDR-16175
>
> Oops, reported in CLDR instead of ICU. Moved to:
>
> https://unicode-org.atlassian.net/browse/ICU-22215

Out of morbid curiosity I went source diving, and the culprit is this
bit (which will also break if a version component ever goes above 999):

/* write the decimal field value */
field=versionArray[part];
if(field>=100) {
*versionString++=(char)('0'+field/100);
field%=100;
}
if(field>=10) {
*versionString++=(char)('0'+field/10);
field%=10;
}
*versionString++=(char)('0'+field);

(https://sources.debian.org/src/icu/72.1-3/source/common/putil.cpp#L2308)

because apparently snprintf() is too hard?

- ilmari

Re: Collation version tracking for macOS

2022-11-29 Thread Michael Paquier

On Wed, Nov 30, 2022 at 01:50:51PM +1300, Thomas Munro wrote:
> The new character added to Version 12.1 is:
> 
> U+32FF SQUARE ERA NAME REIWA
> 
> Version 12.1 adds that single character to enable software to be
> rapidly updated to support the new Japanese era name in calendrical
> systems and date formatting. The new Japanese era name was officially
> announced on April 1, 2019, and is effective as of May 1, 2019."
> 
> Wow!

Wow++.  I didn't know this one.

> Wikipedia says[2] "the "rei" character 令 has never appeared before".

At least there was some time ahead to prepare for the switch from "平
成" to "令和".  Things were much "funnier" when the era has switched
from "昭和" to "平成", as the sudden death of the emperor has required
Japan to switch to a new calendar very suddenly back in the day..
I've heard this was quite a mess for folks in IT back then, especially
for public agencies.
--
Michael


signature.asc
Description: PGP signature

Re: Collation version tracking for macOS

2022-11-29 Thread Thomas Munro

On Wed, Nov 30, 2022 at 1:25 PM Jeff Davis  wrote:
> On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote:
> > On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis  wrote:
> > > On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> > > https://unicode-org.atlassian.net/browse/ICU-22216
> >
> > I'm no expert on loader/linker arcana but I have a feeling this is a
> > dead end.  It's an ancient Unix or at least elf-era Unix convention
> > that SONAMEs have major versions only, because major versions are the
> > basis of ABI stability.
>
> It's possible that it's more a problem of how they are doing it: the
> specific version is coming from a dependency rather than the library
> itself. The results are surprising, so I figured it's worth a report.
> Let's see what they say.
>
> Regardless, even if they did make a change, it's not going to help us
> anytime soon. We can't rely on any scheme that involves multiple minor
> versions for a single major version being installed at once. That means
> that, if you create a collation depending on ICU X.Y, and then it gets
> upgraded to X.(Y+1), and you create another collation depending on that
> library version, you are stuck.

Mainstream package maintainers aren't going to let that happen anyway
as discussed, so this would always be a fairly specialised concern.
Maybe someone in our community would be motivated to publish a repo
full of mutant packages that don't conflict with each other and that
have specially modified DT_NEEDED, or are rolled into one single
library so the DT_NEEDED problem goes away.

Re: Collation version tracking for macOS

2022-11-29 Thread Thomas Munro

On Wed, Nov 30, 2022 at 1:32 PM Jeff Davis  wrote:
> On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:
> > On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis  wrote:
> > > Here's what I found for the 'ar' locale (firstminor/lastminor are
> > > the
> > > icu library versions, firstcollversion/lastcollversion are their
> > > respective collation versions for the given locale):
> > >
> > >  firstminor | lastminor | firstcollversion | lastcollversion
> > > +---+--+-
> > >  60.1   | 60.3  | 153.80.32| 153.80.32.1
> > >  64.1   | 64.2  | 153.96.35| 153.97.35.8
> > >  68.1   | 68.2  | 153.14.38| 153.14.38.8
> > > (3 rows)
> >
> > Right, this fits with what I said earlier: the third component is
> > CLDR
> > major, fourth component is CLDR minor except from ICU 61 on the CLDR
> > minor is << 3'd (X.X.38.8 means CLDR 38.1).
>
> What about 64.1 -> 64.2? That changed the *second* component from 96 ->
> 97. Are we agreed that collations can materially change in minor ICU
> releases?

That means that the Unicode/UCA version switched from 12 to 12.1, so
that's a confirmed sighting of a UCA minor version bump within one ICU
major version.  Let's see what the purpose of that Unicode minor
release was[1]:

"Unicode 12.1 adds exactly one character, for a total of 137,929 characters.

The new character added to Version 12.1 is:

U+32FF SQUARE ERA NAME REIWA

Version 12.1 adds that single character to enable software to be
rapidly updated to support the new Japanese era name in calendrical
systems and date formatting. The new Japanese era name was officially
announced on April 1, 2019, and is effective as of May 1, 2019."

Wow!

Wikipedia says[2] "the "rei" character 令 has never appeared before".

The sort order of characters that didn't previously exist is a special
topic.  In theory they can't hurt you because you shouldn't have been
using them, but PostgreSQL doesn't enforce that (other systems do), so
you could be exposed to a change from whatever default ordering the
non-existent codepoint had for random implementation reasons to some
deliberate ordering which may or may not be the same.

Are all Unicode/UCA minor versions of that type?  I dunno.  Something
to research, but [3] is far too vague and [4] is about other problems.

[1] https://unicode.org/versions/Unicode12.1.0/
[2] https://en.wikipedia.org/wiki/Reiwa
[3] https://www.unicode.org/versions/#major_minor
[4] https://www.unicode.org/policies/stability_policy.html

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote:
> On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis  wrote:
> > Here's what I found for the 'ar' locale (firstminor/lastminor are
> > the
> > icu library versions, firstcollversion/lastcollversion are their
> > respective collation versions for the given locale):
> > 
> >  firstminor | lastminor | firstcollversion | lastcollversion
> > +---+--+-
> >  60.1   | 60.3  | 153.80.32    | 153.80.32.1
> >  64.1   | 64.2  | 153.96.35    | 153.97.35.8
> >  68.1   | 68.2  | 153.14.38    | 153.14.38.8
> > (3 rows)
> 
> Right, this fits with what I said earlier: the third component is
> CLDR
> major, fourth component is CLDR minor except from ICU 61 on the CLDR
> minor is << 3'd (X.X.38.8 means CLDR 38.1).

What about 64.1 -> 64.2? That changed the *second* component from 96 ->
97. Are we agreed that collations can materially change in minor ICU
releases?


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote:
> On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis  wrote:
> > On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> > > One bit of weirdness is that I may have found another ICU
> > > problem.
> > 
> > Reported as:
> > 
> > https://unicode-org.atlassian.net/browse/ICU-22216
> 
> I'm no expert on loader/linker arcana but I have a feeling this is a
> dead end.  It's an ancient Unix or at least elf-era Unix convention
> that SONAMEs have major versions only, because major versions are the
> basis of ABI stability.

It's possible that it's more a problem of how they are doing it: the
specific version is coming from a dependency rather than the library
itself. The results are surprising, so I figured it's worth a report.
Let's see what they say.

Regardless, even if they did make a change, it's not going to help us
anytime soon. We can't rely on any scheme that involves multiple minor
versions for a single major version being installed at once. That means
that, if you create a collation depending on ICU X.Y, and then it gets
upgraded to X.(Y+1), and you create another collation depending on that
library version, you are stuck.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Thomas Munro

On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis  wrote:
> On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> > One bit of weirdness is that I may have found another ICU problem.
>
> Reported as:
>
> https://unicode-org.atlassian.net/browse/ICU-22216

I'm no expert on loader/linker arcana but I have a feeling this is a
dead end.  It's an ancient Unix or at least elf-era Unix convention
that SONAMEs have major versions only, because major versions are the
basis of ABI stability.

As a workaround with an already built ICU, I think you could use elf
editing tools like "patchelf" to change the SONAME and DT_NEEDED to
include the minor version.  Or you could convince the build/link
scripts to set them that way in the first place, but no distro would
want to do that as it would cause lots of executables to fail to load
when the next ICU minor comes out.

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Wed, 2022-11-30 at 09:00 +1300, Thomas Munro wrote:
> I'm struggling to understand what's new about proposal #6.

Perhaps it's just a slight variant; I'm not sure. It's not a complete
proposal yet.

The difference I had in mind is that it would treat the built-in ICU
differently from what is found in icu_library_path. I think that could
remove confusion over what happens when you upgrade the system's ICU
library.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Thomas Munro

On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis  wrote:
> Here's what I found for the 'ar' locale (firstminor/lastminor are the
> icu library versions, firstcollversion/lastcollversion are their
> respective collation versions for the given locale):
>
>  firstminor | lastminor | firstcollversion | lastcollversion
> +---+--+-
>  60.1   | 60.3  | 153.80.32| 153.80.32.1
>  64.1   | 64.2  | 153.96.35| 153.97.35.8
>  68.1   | 68.2  | 153.14.38| 153.14.38.8
> (3 rows)

Right, this fits with what I said earlier: the third component is CLDR
major, fourth component is CLDR minor except from ICU 61 on the CLDR
minor is << 3'd (X.X.38.8 means CLDR 38.1).  I wrote something about
that particular CLDR upgrade that happened in ICU 68 back here, with a
link to the CLDR change list:

https://www.postgresql.org/message-id/ca+hukgjxg6abkc9rj7r1byvltvvkthqv+rzo6bkvwyespcp...@mail.gmail.com

TL;DR that particular CLDR change didn't actually affect collations,
it affected other locale stuff we don't care about (timezones etc).
We probably have to assume that any CLDR change *might* affect us,
though, unless we can find a written policy somewhere that says CLDR
minor changes never change sort order.  But I wouldn't want to get
into 2nd guessing their ucol_getVersion() format, and if they knew
that minor changes didn't affect sort order they presumably wouldn't
have included it in the recipe, so I think we simply have to treat it
as opaque and assume that ucol_getVersion() change means what it says
on the tin: sort order might have changed.

> I suppose the next step is to test with actual data and find
> differences?

Easier to read the published CLDR deltas, but I'm not sure it'd tell
us much about what *could* happen in future releases...

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Wed, 2022-11-30 at 08:41 +1300, Thomas Munro wrote:
> In terms of user experience, I think that might mean that users of
> 'zh' who encounter warnings after a minor upgrade would therefore
> really only have the options of REFRESHing and rebuilding, or
> downgrading the package, because there's no way for us to access the
> older version.  Users of 'en' probably only encounter collversion
> changes when moving between OS releases with an ICU major version
> change, and then the various schemes in this thread can help them
> avoid the need to rebuild, until they eventually want to, if ever.

I installed the first minor release for each major, and got some new
tables. I think we can all agree that it's a lot easier to work with
information once it's in table form.

Here's what I found for the 'ar' locale (firstminor/lastminor are the
icu library versions, firstcollversion/lastcollversion are their
respective collation versions for the given locale):

 firstminor | lastminor | firstcollversion | lastcollversion 
+---+--+-
 60.1   | 60.3  | 153.80.32| 153.80.32.1
 64.1   | 64.2  | 153.96.35| 153.97.35.8
 68.1   | 68.2  | 153.14.38| 153.14.38.8
(3 rows)

For 'en':

 firstminor | lastminor | firstcollversion | lastcollversion 
+---+--+-
 64.1   | 64.2  | 153.96   | 153.97
(1 row)

And for 'zh':

 firstminor | lastminor | firstcollversion | lastcollversion 
+---+--+-
 60.1   | 60.3  | 153.80.32| 153.80.32.1
 64.1   | 64.2  | 153.96.35| 153.97.35.8
 68.1   | 68.2  | 153.14.38| 153.14.38.8
(3 rows)

It looks like collation versions do change in minor releases. It looks
like it's *not* safe to lock a collation to a major version *if* that
major version could be updated to a new minor. And we can't lock to a
minor, as I said earlier. Therefore, once we lock a collation down to a
major release, we better keep that in the icu_library_path, and never
touch it, and never install a new minor for that major.

Then again, maybe some of these are just about how the version is
reported... maybe 153.80.32 and 153.80.32.1 are really the same
version? But 64.1 -> 64.2 looks like a real difference.

I suppose the next step is to test with actual data and find
differences?

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Tue, 2022-11-29 at 14:34 -0500, Joe Conway wrote:
> I understand that it is not easily done, but if the combination of 
> collprovider + collversion does not represent specific immutable 
> ordering behavior for a given locale

Given the u_versionToString() bug, we know the version string could end
up being the same between two different collation versions (e.g.
153.104 and 153.14). So that really undermines the credibility of ICU's
collation versions (at least the strings, which is what we store in
collversion).

But if we ignore that bug, do we have evidence that the actual versions
could be the same for collations that sort differently? It's worth
exploring, to be sure, but right now I don't know of a case.

> , what value is there in tracking [collation version]?

Similarly, what is the value in tracking the library minor versions, if
when you open libicui18n.63.1, you may end up with a mix of code
between 63.1 and 63.2?

That doesn't mean it's impossible. We could attach collations to a
library major version, and tell administrators that once they install a
major version in icu_library_path, they never touch that major version
again (no updates or new minors, only new majors). #6 might be a good
approach to facilitate this best practice. We'd then probably need to
change collversion to be a library major version, and then come up with
a migration path from 15 -> 16. Or we could store both library major
version and collversion, and verify both.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Thomas Munro

On Wed, Nov 30, 2022 at 8:52 AM Robert Haas  wrote:
> On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis  wrote:
> > 6. Create a new concept of a "locked down collation" that points at
> > some specific collation code (identified by some combination of library
> > version and collation version or whatever else can be used to identify
> > it). If a collation is locked down, it would never have a fallback or
> > any other magic, it would either find the code it's looking for, or
> > fail. If a collation is not locked down, it would look only in the
> > built-in ICU library, and warn if it detects some kind of change
> > (again, by whatever heuristic we think is reasonable).
>
> It seems like it would be somewhat reasonable to allow varying levels
> of specificity in saying which what suffix to append when calling
> dlopen() on the ICU library. Like you could allow adding nothing,
> which would find the system-default ICU, or you could add 53 to find
> the default version of ICU 53, or you could 53.1 to pick a specific
> minor version. The idea is that the symlinks in the filesystem would
> be responsible for sorting out the meaning of the supplied string. The
> way that minor versions work may preclude having this work as well as
> one might hope, though.

I'm struggling to understand what's new about proposal #6.  The
earlier proposals except #1 already contemplated different levels of
locked-down-ness.  For example in the libversion-as-provider idea, we
said you could use just provider = ICU (warn me if the collverison
changes, but always use the "default" library and carry on, pretty
much like today except perhaps "the default" can be changed with a
GUC), or you could be more specific and say provider = ICU63.  (We
also mentioned ICU63_2 as a third level of specificity, but maybe
that's practically impossible.)  And it was the same for the other
ideas, just encoded in different ways.

Re: Collation version tracking for macOS

2022-11-29 Thread Robert Haas

On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis  wrote:
> 6. Create a new concept of a "locked down collation" that points at
> some specific collation code (identified by some combination of library
> version and collation version or whatever else can be used to identify
> it). If a collation is locked down, it would never have a fallback or
> any other magic, it would either find the code it's looking for, or
> fail. If a collation is not locked down, it would look only in the
> built-in ICU library, and warn if it detects some kind of change
> (again, by whatever heuristic we think is reasonable).

It seems like it would be somewhat reasonable to allow varying levels
of specificity in saying which what suffix to append when calling
dlopen() on the ICU library. Like you could allow adding nothing,
which would find the system-default ICU, or you could add 53 to find
the default version of ICU 53, or you could 53.1 to pick a specific
minor version. The idea is that the symlinks in the filesystem would
be responsible for sorting out the meaning of the supplied string. The
way that minor versions work may preclude having this work as well as
one might hope, though.

I continue to be confused about why collation maintainers think that
it's OK to whack stuff around in minor versions. The thought that
people might use collations to sort data that needs to stay sorted
after upgrading the library seems to be an alien one, and it doesn't
really seem like libicu is a whole lot better than libc, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2022-11-29 Thread Thomas Munro

On Wed, Nov 30, 2022 at 8:03 AM Jeff Davis  wrote:
> On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote:
> > I think it also includes the CLDR version for *some* locales.  From a
> > quick look, that includes 'ar', 'ru', 'tr', 'zh'.  Jeff, would you
> > mind sharing the same table for one of those?  Perhaps 'en' really
> > does depend only on the UCA?
>
> =# select * from pg_icu_collation_versions('ar') order by icu_version;
>  icu_version | uca_version | collator_version
> -+-+--
>  50.2| 6.2 | 58.0.0.50
>  51.3| 6.2 | 58.0.0.50
>  52.2| 6.2 | 58.0.0.50
>  53.2| 6.3 | 137.51.25
>  54.2| 7.0 | 137.56.26
>  55.2| 7.0 | 153.56.27.1
>  56.2| 8.0 | 153.64.28
>  57.2| 8.0 | 153.64.29
>  58.3| 9.0 | 153.72.30.3
>  59.2| 9.0 | 153.72.31.1
>  60.3| 10.0| 153.80.32.1
>  61.2| 10.0| 153.80.33
>  62.2| 11.0| 153.88.33.8
>  63.2| 11.0| 153.88.34
>  64.2| 12.1| 153.97.35.8
>  65.1| 12.1| 153.97.36
>  66.1| 13.0| 153.14.36.8
>  67.1| 13.0| 153.14.37
>  68.2| 13.0| 153.14.38.8
>  69.1| 13.0| 153.14.39
>  70.1| 14.0| 153.112.40
> (21 rows)

Thanks.  So now we can see that the CLDR minor version is there too.
At a guess, in ICU 60 and before, it was the 4th component directly,
and from ICU 61 on, it's shifted left 3 bits.  I guess that means
those CLDR-dependent locales have higher frequency collversion
changes, including everyday "apt-get upgrade" (no major OS upgrade
required), assuming that Debian et al take those minor upgrades, while
others like 'en' should be stable for the whole ICU major version's
lifetime, and even across some ICU major version upgrades, because the
Unicode/UCA version changes more slowly.

Those CLDR-dependent locales therefore present us with a problem: as
discussed a while back, it's impossible to install two minor versions
of the same ICU major version with packages, and as Jeff has pointed
out in recent emails, even if you compile them yourself (which no one
really expects users to do), it doesn't really work because the
SONAMEs only have the major version, so the various libraries
that make up ICU will not be able to open each other correctly
(they'll follow symlinks to an arbitrary minor version).  (These two
things are not unrelated.)  So I probably need to remove the code that
claimed to support minor version addressing and go back to the
previous thinking that major will have to be enough.

In terms of user experience, I think that might mean that users of
'zh' who encounter warnings after a minor upgrade would therefore
really only have the options of REFRESHing and rebuilding, or
downgrading the package, because there's no way for us to access the
older version.  Users of 'en' probably only encounter collversion
changes when moving between OS releases with an ICU major version
change, and then the various schemes in this thread can help them
avoid the need to rebuild, until they eventually want to, if ever.

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote:
> One bit of weirdness is that I may have found another ICU problem.

Reported as:

https://unicode-org.atlassian.net/browse/ICU-22216


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote:
> On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:
> > That is ... astonishingly bad.
> 
> https://unicode-org.atlassian.net/browse/CLDR-16175

Oops, reported in CLDR instead of ICU. Moved to:

https://unicode-org.atlassian.net/browse/ICU-22215


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Joe Conway


On 11/29/22 13:59, Jeff Davis wrote:

On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote:
My vote is for something like #5. The collversion should indicate a 
specific immutable ordering behavior.


Easier said than done:
https://www.postgresql.org/message-id/abddc35a7a447d93e2b8371a1a9052cb48866070.ca...@j-davis.com

Even pointing at a specific minor version doesn't guarantee that
specific ICU code is loaded. It could also be a mix of different minor
versions that happen to be installed.


I understand that it is not easily done, but if the combination of 
collprovider + collversion does not represent specific immutable 
ordering behavior for a given locale, what value is there in tracking it?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote:
> On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider
>  wrote:
> > It seems to me that the collator_version field is not a good
> > version
> > identifier to use.
> > 
> > Just taking a quick glance at the ICU home page right now, it shows
> > that
> > all of the last 5 versions of ICU have included "additions and
> > corrections" to locale data itself, including 68 to 69 where the
> > collator version did not change.
> > 
> > Is it possible that this "collator_version" only reflects the code
> > that
> > processes collation data to do comparisons/sorts, but it does not
> > reflect updates to the locale data itself?
> 
> I think it also includes the CLDR version for *some* locales.  From a
> quick look, that includes 'ar', 'ru', 'tr', 'zh'.  Jeff, would you
> mind sharing the same table for one of those?  Perhaps 'en' really
> does depend only on the UCA?

=# select * from pg_icu_collation_versions('ar') order by icu_version;
 icu_version | uca_version | collator_version 
-+-+--
 50.2| 6.2 | 58.0.0.50
 51.3| 6.2 | 58.0.0.50
 52.2| 6.2 | 58.0.0.50
 53.2| 6.3 | 137.51.25
 54.2| 7.0 | 137.56.26
 55.2| 7.0 | 153.56.27.1
 56.2| 8.0 | 153.64.28
 57.2| 8.0 | 153.64.29
 58.3| 9.0 | 153.72.30.3
 59.2| 9.0 | 153.72.31.1
 60.3| 10.0| 153.80.32.1
 61.2| 10.0| 153.80.33
 62.2| 11.0| 153.88.33.8
 63.2| 11.0| 153.88.34
 64.2| 12.1| 153.97.35.8
 65.1| 12.1| 153.97.36
 66.1| 13.0| 153.14.36.8
 67.1| 13.0| 153.14.37
 68.2| 13.0| 153.14.38.8
 69.1| 13.0| 153.14.39
 70.1| 14.0| 153.112.40
(21 rows)


=# select * from pg_icu_collation_versions('zh') order by icu_version;
 icu_version | uca_version | collator_version 
-+-+--
 50.2| 6.2 | 58.0.0.50
 51.3| 6.2 | 58.0.0.50
 52.2| 6.2 | 58.0.0.50
 53.2| 6.3 | 137.51.25
 54.2| 7.0 | 137.56.26
 55.2| 7.0 | 153.56.27.1
 56.2| 8.0 | 153.64.28
 57.2| 8.0 | 153.64.29
 58.3| 9.0 | 153.72.30.3
 59.2| 9.0 | 153.72.31.1
 60.3| 10.0| 153.80.32.1
 61.2| 10.0| 153.80.33
 62.2| 11.0| 153.88.33.8
 63.2| 11.0| 153.88.34
 64.2| 12.1| 153.97.35.8
 65.1| 12.1| 153.97.36
 66.1| 13.0| 153.14.36.8
 67.1| 13.0| 153.14.37
 68.2| 13.0| 153.14.38.8
 69.1| 13.0| 153.14.39
 70.1| 14.0| 153.112.40
(21 rows)


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote:
> My vote is for something like #5. The collversion should indicate a 
> specific immutable ordering behavior.

Easier said than done:

https://www.postgresql.org/message-id/abddc35a7a447d93e2b8371a1a9052cb48866070.ca...@j-davis.com

Even pointing at a specific minor version doesn't guarantee that
specific ICU code is loaded. It could also be a mix of different minor
versions that happen to be installed.

But if we ignore that problem for a moment, and assume that major
version is precise enough, let me make another proposal (not advocating
for this, but wanted to put it out there):

6. Create a new concept of a "locked down collation" that points at
some specific collation code (identified by some combination of library
version and collation version or whatever else can be used to identify
it). If a collation is locked down, it would never have a fallback or
any other magic, it would either find the code it's looking for, or
fail. If a collation is not locked down, it would look only in the
built-in ICU library, and warn if it detects some kind of change
(again, by whatever heuristic we think is reasonable).

#6 doesn't answer all of the problems I pointed out earlier:

https://www.postgresql.org/message-id/83faecb4a89dfb5794938e7b4d9f89daf4c5d631.ca...@j-davis.com

but could be a better starting place for answers.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Jeff Davis

On Tue, 2022-11-29 at 12:32 -0500, Robert Haas wrote:
> You know more about this than I do, for sure, so don't let my vote
> back the project into a bad spot.

I'm going back and forth myself. I haven't found a great answer here
yet.

>  But, yeah, the thing you mention
> here is what I'm worried about. Without a way to force a certain
> behavior for a certain particular collation, you don't have an escape
> valve if the global library ordering isn't doing what you want.

One bit of weirdness is that I may have found another ICU problem.
First, install 63.1, and you get (editing for clarity):

$ ls -l /path/to/libicui18n.so.63*
/path/to/libicui18n.so.63 -> libicui18n.so.63.1
/path/to/libicui18n.so.63.1

$ ls -l /path/to/libicuuc.so.63*
/path/to/libicuuc.so.63 -> libicuuc.so.63.1
/path/to/libicuuc.so.63.1

$ ls -l /path/to/libicudata.so.63*
/path/to/libicudata.so.63 -> libicudata.so.63.1
/path/to/lib/libicudata.so.63.1

$ ldd /path/to/libicui18n.so.63.1
libicuuc.so.63 => /path/to/libicuuc.so.63
libicudata.so.63 => /path/to/libicudata.so.63 

OK, now install 63.2. Then you get:

$ ls -l /path/to/libicui18n.so.63*
/path/to/libicui18n.so.63 -> libicui18n.so.63.2
/path/to/libicui18n.so.63.1
/path/to/libicui18n.so.63.2

$ ls -l /path/to/libicuuc.so.63*
/path/to/libicuuc.so.63 -> libicuuc.so.63.2
/path/to/libicuuc.so.63.1
/path/to/libicuuc.so.63.2

$ ls -l /path/to/libicudata.so.63*
/path/to/libicudata.so.63 -> libicudata.so.63.2
/path/to/libicudata.so.63.1
/path/to/libicudata.so.63.2

$ ldd /path/to/libicui18n.so.63.2
libicuuc.so.63 => /path/to/libicuuc.so.63
libicudata.so.63 => /path/to/libicudata.so.63

The problem is that the specific minor version 63.1 depends on only the
major version of its ICU link dependencies. When loading
libicui18n.so.63.1, you are actually pulling in libicuuc.so.63.2 and
libicudata.so.63.2.

When I tried this with Thomas's patch, it caused some confusing
problems. I inserted a check that, when you open a library, that the
requested and reported versions match, and the check failed when
multiple minors are installed. In other words, opening
libicui18n.so.63.1 reports a version of 63.2!

(Note: I compiled ICU with --enable-rpath, but I don't think it
matters.)

Summary: even locking down to a minor version does not seem to identify
a specific ICU library, because its shared library dependencies do not
reference a specific minor version.

> It's entirely possible that the scenario I'm worried about is too
> remote in practice to be concerned about. I don't know how this stuff
> works well enough to be certain. It's just that, on the basis of
> previous experience, (1) it's not that uncommon for people to
> actually
> end up in situations that we thought shouldn't ever happen and (2)
> code that deals with collations is more untrustworthy than average.

Yeah...

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-29 Thread Thomas Munro

On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider
 wrote:
> It seems to me that the collator_version field is not a good version
> identifier to use.
>
> Just taking a quick glance at the ICU home page right now, it shows that
> all of the last 5 versions of ICU have included "additions and
> corrections" to locale data itself, including 68 to 69 where the
> collator version did not change.
>
> Is it possible that this "collator_version" only reflects the code that
> processes collation data to do comparisons/sorts, but it does not
> reflect updates to the locale data itself?

I think it also includes the CLDR version for *some* locales.  From a
quick look, that includes 'ar', 'ru', 'tr', 'zh'.  Jeff, would you
mind sharing the same table for one of those?  Perhaps 'en' really
does depend only on the UCA?

Re: Collation version tracking for macOS

2022-11-29 Thread Jeremy Schneider

On 11/28/22 6:54 PM, Jeff Davis wrote:

> 
> =# select * from pg_icu_collation_versions('en_US') order by
> icu_version;
>  icu_version | uca_version | collator_version 
> -+-+--
>  ...
>  67.1| 13.0| 153.14
>  68.2| 13.0| 153.14
>  69.1| 13.0| 153.14
>  70.1| 14.0| 153.112
> (21 rows)
> 
> This is good information, because it tells us that major library
> versions change more often than collation versions, empirically-
> speaking.

It seems to me that the collator_version field is not a good version
identifier to use.

Just taking a quick glance at the ICU home page right now, it shows that
all of the last 5 versions of ICU have included "additions and
corrections" to locale data itself, including 68 to 69 where the
collator version did not change.

Is it possible that this "collator_version" only reflects the code that
processes collation data to do comparisons/sorts, but it does not
reflect updates to the locale data itself?

https://icu.unicode.org/

ICU v72 -> CLDR v42
ICU v71 -> CLDR v41
ICU v70 -> CLDR v40
ICU v69 -> CLDR v39
ICU v68 -> CLDR v38

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Collation version tracking for macOS

2022-11-29 Thread Robert Haas

On Mon, Nov 28, 2022 at 11:49 PM Jeff Davis  wrote:
> Not necessarily, #2-4 (at least as implemented in v7) can only load one
> major version at a time, so can't specify minor versions:
> https://www.postgresql.org/message-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.ca...@j-davis.com
>
> With #1, you can provide control over the search order to find the
> symbol you want. Granted, if you want to specify that different
> collations look in different libraries for the same version, then it
> won't work, because the search order is global -- is that what you're
> worried about? If so, I think we need to compare it against the
> downsides of #2-4, which in my opinion are more serious.

You know more about this than I do, for sure, so don't let my vote
back the project into a bad spot. But, yeah, the thing you mention
here is what I'm worried about. Without a way to force a certain
behavior for a certain particular collation, you don't have an escape
valve if the global library ordering isn't doing what you want. Your
argument seems to at least partly be that #1 will be more usable on
the whole, and that does seem like an important consideration. People
may have a lot of collations and adjusting them all individually could
be difficult and unpleasant. However, I think it's also worth asking
what options someone has if #1 can't be made to work due to a single
ordering controlling every collation.

It's entirely possible that the scenario I'm worried about is too
remote in practice to be concerned about. I don't know how this stuff
works well enough to be certain. It's just that, on the basis of
previous experience, (1) it's not that uncommon for people to actually
end up in situations that we thought shouldn't ever happen and (2)
code that deals with collations is more untrustworthy than average.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2022-11-29 Thread Joe Conway


On 11/28/22 14:11, Robert Haas wrote:

On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro  wrote:

OK.  Time for a new list of the various models we've discussed so far:

1.  search-by-collversion:  We introduce no new "library version"
concept to COLLATION and DATABASE object and little or no new syntax.

2.  lib-version-in-providers: We introduce a separate provider value
for each ICU version, for example ICU63, plus an unversioned ICU like
today.

3.  lib-version-in-attributes: We introduce daticuversion (alongside
datcollversion) and collicuversion (alongside collversion).  Similar
to the above, but it's a separate property and the provider is always
ICU.  New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
ICU_VERSION.

4.  lib-version-in-locale:  "63:en" from earlier versions.  That was
mostly a strawman proposal to avoid getting bogged down in
syntax/catalogue/model change discussions while trying to prove that
dlopen would even work.  It doesn't sound like anyone really likes
this.

5.  lib-version-in-collversion:  We didn't explicitly discuss this
before, but you hinted at it: we could just use u_getVersion() in
[dat]collversion.


I'd like to vote against #3 at least in the form that's described
here. If we had three more libraries providing collations, it's likely
that they would need versioning, too. So if we add an explicit notion
of provider version, then it ought not to be specific to libicu.


+many


I think it's OK to decide that different library versions are
different providers (your option #2), or that they are the same
provider but give rise to different collations (your option #4), or
that there can be multiple version of each collation which are
distinguished by some additional provider version field (your #3 made
more generic).


I think provider and collation version are distinct concepts. The 
provider ('c' versus 'i' for example) determines a unique code path in 
the backend due to different APIs, whereas collation version is related 
to a specific ordering given a set of characters.




I don't really understand #1 or #5 well enough to have an educated
opinion, but I do think that #1 seems a bit magical. It hopes that the
combination of a collation name and a datcollversion will be
sufficient to find exactly one matcing collation in a list of provided
libraries. The advantage of that, as I understand it, is that if you
do something to your system that causes the number of matches to go
from one to zero, you can just throw another library on the pile and
get the number back up to one. Woohoo! But there's a part of me that
worries: what if the number goes up to two, and they're not all the
same? Probably that's something that shouldn't happen, but if it does
then I think there's kind of no way to fix it. With the other options,
if there's some way to jigger the catalog state to match what you want
to happen, you can always repair the situation somehow, because the
library to be used for each collation is explicitly specified in some
way, and you just have to get it to match what you want to have
happen.


My vote is for something like #5. The collversion should indicate a 
specific immutable ordering behavior.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Collation version tracking for macOS

2022-11-28 Thread Jeff Davis

On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:
> On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro 
> wrote:
> > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis 
> > wrote:
> > > I'd vote for 1 on the grounds that it's easier to document and
> > > understand a single collation version, which comes straight from
> > > ucol_getVersion(). This approach makes it a separate problem to
> > > find
> > > the collation version among whatever libraries the admin can
> > > provide;
> > > but adding some observability into the search should mitigate any
> > > confusion.
> > 
> > OK, it sounds like I should code that up next.
> 
> Here's the first iteration.

Thank you.

Proposed changes:

* I attached a first pass of some documentation.

* Should be another GUC to turn WARNING into an ERROR. Useful at least
for testing; perhaps too dangerous for production.

* The libraries should be loaded in a more diliberate order. The "*"
should be expanded in a descending fashion so that later versions are
preferred.

* GUCs should be validated.

* Should validate that loaded library has expected version.

* We need to revise or remove pg_collation_actual_version() and
pg_database_collation_actual_version().

* The GUCs are PGC_SUSET, but don't take effect because
icu_library_list_fully_loaded is never reset.

* The extra collations you're adding at bootstrap time are named based
on the library major version. I suppose it might be more "proper" to
name them based on the collation version, but that would be more
verbose, so I won't advocate for that. Just pointing it out.

* It looks hard (or impossible) to mix multiple ICU libraries with the
same major version and different minor versions. That's because,
e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63,
and when you install ICU 63.2, those dependencies get clobbered with
the 63.2 versions. That fails the sanity check I proposed above about
the library version number matching the requested library version
number. And it also just seems wrong -- why would you have minor-
version precision about an ICU library but then only major-version
precision about the ICU dependencies of that library? Doesn't that
defeat the whole purpose of this naming scheme? (Maybe another ICU
bug?).

Minor comments:

* ICU_I18N is defined in make_icu_library_name() but used outside of
it. One solution might be to have it return both library names to the
caller and rename it as make_icu_library_names().

* get_icu_function() could use a clarifying comment or a better name.
Something that communicates that you are looking for the function in
the given library with the given major version number (which may or may
not be needed depending on how the library was compiled).

* typo in comment over make_icu_collator:
s/u_getVersion/ucol_getVersion/

* The return value of make_icu_collator() seems backwards to me,
stylistically. I typically see the false-is-good pattern with integer
returns.

* weird bracketing style in get_icu_collator for the "else"

>   The version rosetta stone functions look like this:
> 
> postgres=# select * from pg_icu_library_versions();
>  icu_version | unicode_version | cldr_version
> -+-+--
>  67.1    | 13.0    | 37.0
>  63.1    | 11.0    | 34.0
>  57.1    | 8.0 | 29.0
> (3 rows)
> 
> postgres=# select * from pg_icu_collation_versions('zh');
>  icu_version | uca_version | collator_version
> -+-+--
>  67.1    | 13.0    | 153.14.37
>  63.1    | 11.0    | 153.88.34
>  57.1    | 8.0 | 153.64.29
> (3 rows)

I like these functions.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 445fd175d8..b9dba8ac67 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1047,6 +1047,50 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr

+  
+   Collation Versions
+
+   
+Collations are sensitive to the specific collation version, which is
+obtained from the collation provider library at the time the collation is
+created (and only updated with ). If
+the collation provider library is updated on the system (e.g. due to an
+operating system upgrade), it may provide a different collation version;
+but the version recorded in PostgreSQL will
+remain unchanged.
+   
+   
+New collation versions are generally desirable, as they reflect changes in
+natural language over time. But these ordering changes can also cause
+problems, such as the inconsistency of an indexes, which often depend on a
+stable ordering. If PostgreSQL is unable to
+find a collation in the collation provider that matches the recorded
+version exactly, it will emit a WARNING (configurable
+with ).
+   
+   
+Multiple ICU collation provider libraries
+
+ When using the

Re: Collation version tracking for macOS

2022-11-28 Thread Jeff Davis

On Mon, 2022-11-28 at 14:11 -0500, Robert Haas wrote:
> I don't really understand #1 or #5 well enough to have an educated
> opinion, but I do think that #1 seems a bit magical. It hopes that
> the
> combination of a collation name and a datcollversion will be
> sufficient to find exactly one matcing collation in a list of
> provided
> libraries. The advantage of that, as I understand it, is that if you
> do something to your system that causes the number of matches to go
> from one to zero, you can just throw another library on the pile and
> get the number back up to one. Woohoo! But there's a part of me that
> worries: what if the number goes up to two, and they're not all the
> same? Probably that's something that shouldn't happen, but if it does
> then I think there's kind of no way to fix it. With the other
> options,
> if there's some way to jigger the catalog state to match what you
> want
> to happen, you can always repair the situation somehow, because the
> library to be used for each collation is explicitly specified in some
> way, and you just have to get it to match what you want to have
> happen.

Not necessarily, #2-4 (at least as implemented in v7) can only load one
major version at a time, so can't specify minor versions:
https://www.postgresql.org/message-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.ca...@j-davis.com

With #1, you can provide control over the search order to find the
symbol you want. Granted, if you want to specify that different
collations look in different libraries for the same version, then it
won't work, because the search order is global -- is that what you're
worried about? If so, I think we need to compare it against the
downsides of #2-4, which in my opinion are more serious.

The first thing to sort out with options #2-4 is: what about minor
versions? V7 took the approach that only the major version matters.
That means that if you want to select a specific minor version, then
you are out of luck, because only one major at a time can be loaded,
globally. But paying attention to minor versions seems like a mess --
we'd need even more magical fallbacks that try later minor versions or
something.

Second, there is weirdness in the common case that a collation version
doesn't change between versions. Let's say you have a collation
"mycoll" with locale "en_US" and it's pointed at built-in library
version 64, with collation version 153.97. GUC
default_icu_library_version is set to 63. Then you upgrade the system
and ICU gets updated from 64 -> 65. Now, it can't find version 64 to
load, so it falls back to 63 (which has the wrong version 153.88), even
though 65 is just fine because it still offers that locale with version
153.97. (A similar problem exists when you remove a version of ICU from
icu_library_path, and another version suffices for all of your
collations.)

Thirdly, as I said earlier, it's just hard on the user to try to sort
out two different versions modeled in the database. Understanding
encodings and collations are hard enough, and then we introduce *two*
versions on top of that.

Fourth, I don't see what the point of ucol_getVersion() is in schemes
#2-4. All it does is control a WARNING, because throwing an error (at
least by default) would be too harsh, given that users have lived with
these risks for so long. But if all it does is throw a warning, what's
the point in modeling it in the catalog as though it's the most
important version?

Ultimately, I think collation version (as reported by
ucol_getVersion()) is the most accurate and least-surprising way to
match a library-provided collation with the collation in the catalog.
And it seems like we'd be using it in exactly the way the ICU
maintainers intend it to be used.

Of course, I cast my vote for #1 before I discovered this ICU bug
here: 
https://www.postgresql.org/message-id/0f7922d4f411376f420ec9139febeae4cdc748a6.ca...@j-davis.com

That injects some doubt, to be sure. If I were to try to solve the
problems with #2-4, one approach might be to treat the built-in ICU
version differently from the ones in icu_library_path. Not quite sure,
I'd have to think more. But as of now, I'd still lean toward #1 until a
better option is presented.

Regards,
Jeff Davis

Re: Collation version tracking for macOS

2022-11-28 Thread Thomas Munro

On Tue, Nov 29, 2022 at 3:55 PM Jeff Davis  wrote:
> =# select * from pg_icu_collation_versions('en_US') order by
> icu_version;
>  icu_version | uca_version | collator_version
> -+-+--
>  50.2| 6.2 | 58.0.6.50
>  51.3| 6.2 | 58.0.6.50
>  52.2| 6.2 | 58.0.6.50
>  53.2| 6.3 | 137.51
>  54.2| 7.0 | 137.56
>  55.2| 7.0 | 153.56
>  56.2| 8.0 | 153.64
>  57.2| 8.0 | 153.64
>  58.3| 9.0 | 153.72
>  59.2| 9.0 | 153.72
>  60.3| 10.0| 153.80
>  61.2| 10.0| 153.80
>  62.2| 11.0| 153.88
>  63.2| 11.0| 153.88
>  64.2| 12.1| 153.97
>  65.1| 12.1| 153.97
>  66.1| 13.0| 153.14
>  67.1| 13.0| 153.14
>  68.2| 13.0| 153.14
>  69.1| 13.0| 153.14
>  70.1| 14.0| 153.112
> (21 rows)
>
> This is good information, because it tells us that major library
> versions change more often than collation versions, empirically-
> speaking.

Wow, nice discovery about 104 -> 14.  Yeah, I imagine we'll want some
kind of band-aid to tolerate that exact screwup and avoid spurious
warnings.

Bugs aside, that's quite a revealing table in other ways.  We can see:

* The version scheme changed completely in ICU 53.  This corresponds
to a major rewrite of the collation code, I see[1].

* The first component seems to be (UCOL_RUNTIME_VERSION << 4) + 9.
UCOL_RUNTIME_VERSION is in their uvernum.h, currently 9, was 8, bumped
between 54 and 55 (I see this in their commit log), corresponding to
the two possible numbers 137 and 153 that we see there.  I don't know
where the final 9 term is coming from but it looks stable since the v2
collation rewrite landed.

* The second component seems to be uca_version_major * 8 +
uca_version_minor (that's the Unicode Collation Algorithm version, and
so far always matches the Unicode version, visible in the output of
the other function).

* The values you showed for English don't have a third component, but
if you try some other locales like 'zh' you'll see the CLDR major
version in third position.  So I guess some locales depend on CLDR
data and others don't.

TL;DR it *looks* like the set of ingredients for the version string is:

* UCOL_RUNTIME_VERSION (rarely changes)
* UCA/Unicode major.minor version
* sometimes CLDR major version, not sure when
* 9

[1] https://icu.unicode.org/design/collation/v2

Re: Collation version tracking for macOS

2022-11-28 Thread Jeff Davis

On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote:
> That is ... astonishingly bad.

https://unicode-org.atlassian.net/browse/CLDR-16175


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-28 Thread Robert Haas

On Mon, Nov 28, 2022 at 9:55 PM Jeff Davis  wrote:
> But did you notice that the version went backwards from 65.1 -> 66.1?
> Well, actually, it didn't. The version of that collation in 66.1 went
> from 153.97 -> 153.104. But there's a bug in versionToString() that
> does the decimal output incorrectly when there's a '0' digit between
> the hundreds and the ones place. I'll see about reporting that, but I
> thought I'd mention it here because it could have consequences, as we
> are storing the strings :-(
>
> The bug is still present in 70.1, but it's masked because it went to
> .112.
>
> Incidentally, this answers our other question about whether the
> collation version can change in a minor version update. Perhaps not,
> but if they fix this bug and backport it, then the version *string*
> will change in a minor update. Ugh.

That is ... astonishingly bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2022-11-28 Thread Jeff Davis

On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote:
> Here's the first iteration.

I will send a full review shortly, but I encountered an ICU bug along
the way, which caused me some confusion for a bit. I'll skip past the
various levels of confusion I had (burned a couple hours), and get
right to the repro:

Install the latest release of all major versions 50-69, and compile
postgres against 70. You'll get:

=# select * from pg_icu_collation_versions('en_US') order by
icu_version;
 icu_version | uca_version | collator_version 
-+-+--
 50.2| 6.2 | 58.0.6.50
 51.3| 6.2 | 58.0.6.50
 52.2| 6.2 | 58.0.6.50
 53.2| 6.3 | 137.51
 54.2| 7.0 | 137.56
 55.2| 7.0 | 153.56
 56.2| 8.0 | 153.64
 57.2| 8.0 | 153.64
 58.3| 9.0 | 153.72
 59.2| 9.0 | 153.72
 60.3| 10.0| 153.80
 61.2| 10.0| 153.80
 62.2| 11.0| 153.88
 63.2| 11.0| 153.88
 64.2| 12.1| 153.97
 65.1| 12.1| 153.97
 66.1| 13.0| 153.14
 67.1| 13.0| 153.14
 68.2| 13.0| 153.14
 69.1| 13.0| 153.14
 70.1| 14.0| 153.112
(21 rows)

This is good information, because it tells us that major library
versions change more often than collation versions, empirically-
speaking.

But did you notice that the version went backwards from 65.1 -> 66.1?
Well, actually, it didn't. The version of that collation in 66.1 went
from 153.97 -> 153.104. But there's a bug in versionToString() that
does the decimal output incorrectly when there's a '0' digit between
the hundreds and the ones place. I'll see about reporting that, but I
thought I'd mention it here because it could have consequences, as we
are storing the strings :-(

The bug is still present in 70.1, but it's masked because it went to
.112.

Incidentally, this answers our other question about whether the
collation version can change in a minor version update. Perhaps not,
but if they fix this bug and backport it, then the version *string*
will change in a minor update. Ugh.

Regards,
Jeff Davis

Re: Collation version tracking for macOS

2022-11-28 Thread Robert Haas

On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro  wrote:
> OK.  Time for a new list of the various models we've discussed so far:
>
> 1.  search-by-collversion:  We introduce no new "library version"
> concept to COLLATION and DATABASE object and little or no new syntax.
>
> 2.  lib-version-in-providers: We introduce a separate provider value
> for each ICU version, for example ICU63, plus an unversioned ICU like
> today.
>
> 3.  lib-version-in-attributes: We introduce daticuversion (alongside
> datcollversion) and collicuversion (alongside collversion).  Similar
> to the above, but it's a separate property and the provider is always
> ICU.  New syntax for CREATE/ALTER COLLATION/DATABASE to set and change
> ICU_VERSION.
>
> 4.  lib-version-in-locale:  "63:en" from earlier versions.  That was
> mostly a strawman proposal to avoid getting bogged down in
> syntax/catalogue/model change discussions while trying to prove that
> dlopen would even work.  It doesn't sound like anyone really likes
> this.
>
> 5.  lib-version-in-collversion:  We didn't explicitly discuss this
> before, but you hinted at it: we could just use u_getVersion() in
> [dat]collversion.

I'd like to vote against #3 at least in the form that's described
here. If we had three more libraries providing collations, it's likely
that they would need versioning, too. So if we add an explicit notion
of provider version, then it ought not to be specific to libicu.

I think it's OK to decide that different library versions are
different providers (your option #2), or that they are the same
provider but give rise to different collations (your option #4), or
that there can be multiple version of each collation which are
distinguished by some additional provider version field (your #3 made
more generic).

I don't really understand #1 or #5 well enough to have an educated
opinion, but I do think that #1 seems a bit magical. It hopes that the
combination of a collation name and a datcollversion will be
sufficient to find exactly one matcing collation in a list of provided
libraries. The advantage of that, as I understand it, is that if you
do something to your system that causes the number of matches to go
from one to zero, you can just throw another library on the pile and
get the number back up to one. Woohoo! But there's a part of me that
worries: what if the number goes up to two, and they're not all the
same? Probably that's something that shouldn't happen, but if it does
then I think there's kind of no way to fix it. With the other options,
if there's some way to jigger the catalog state to match what you want
to happen, you can always repair the situation somehow, because the
library to be used for each collation is explicitly specified in some
way, and you just have to get it to match what you want to have
happen.

I don't know too much about this, though, so I might have it all wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Collation version tracking for macOS

2022-11-27 Thread Thomas Munro

On Sat, Nov 26, 2022 at 6:27 PM Thomas Munro  wrote:
> This is just a first cut, but enough to try out and see if we like it,
> what needs to be improved, what edge cases we haven't thought about
> etc.  Let me know what you think.

BTW one problem to highlight (mentioned but buried in the test
comments), is that REFRESH VERSION doesn't affect other sessions or
even the current session.  You have to log out and back in again to
pick up the new version.  Obviously that's not good enough, but fixing
that involves making it transactional, I think.  If you abort, we have
to go back to using the old version, if you commit you keep the new
version and we might also consider telling other backends to start
using the new version -- or something like that.  I think that's just
a Small Matter of Programming, but a little bit finickity and I need
to take a break for a bit and go work on bugs elsewhere, hence v8
didn't address that yet.

Re: Collation version tracking for macOS

2022-11-25 Thread Thomas Munro

On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro  wrote:
> On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis  wrote:
> > I'd vote for 1 on the grounds that it's easier to document and
> > understand a single collation version, which comes straight from
> > ucol_getVersion(). This approach makes it a separate problem to find
> > the collation version among whatever libraries the admin can provide;
> > but adding some observability into the search should mitigate any
> > confusion.
>
> OK, it sounds like I should code that up next.

Here's the first iteration.  The version rosetta stone functions look like this:

postgres=# select * from pg_icu_library_versions();
 icu_version | unicode_version | cldr_version
-+-+--
 67.1| 13.0| 37.0
 63.1| 11.0| 34.0
 57.1| 8.0 | 29.0
(3 rows)

postgres=# select * from pg_icu_collation_versions('zh');
 icu_version | uca_version | collator_version
-+-+--
 67.1| 13.0| 153.14.37
 63.1| 11.0| 153.88.34
 57.1| 8.0 | 153.64.29
(3 rows)

It's no longer necessary to put anything in PG_TEST_EXTRA to run
"meson test irc/020_multiversion" usefully.  It will find extra ICU
versions all by itself in your system library search path and SKIP if
it doesn't find a second major version.  I have tried to cover the
main scenarios that I expect users to encounter in the update TAP
tests, with commentary that I hope will be helpful to assess the
usability of this thing.

Other changes:

* now using RTLD_LOCAL instead of RTLD_GLOBAL (I guess the latter
might cause trouble for someone using --disable-renaming, but I
haven't tested that and am not an expert on linker/loader arcana)
* fixed library names on Windows (based on reading the manual, but I
haven't tested that)
* fixed failure on non-ICU builds (the reason CI was failing in v7,
some misplaced #ifdefs)
* various cleanup
* I've attached a throwaway patch to install a second ICU version on
Debian/amd64 on CI, since otherwise the new test would SKIP on all
systems

This is just a first cut, but enough to try out and see if we like it,
what needs to be improved, what edge cases we haven't thought about
etc.  Let me know what you think.
From 0d96bfbec02245ddce6c985250ff0f8d38e41df9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v8 1/2] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

Whenever creating a DATABASE or COLLATION object that uses ICU, we'll
use the "default" ICU library and record its ucol_getVersion() in the
catalog.  That's usually the one we're linked against but another can be
selected with the setting default_icu_version_library.

Whenever opening an existing DATABASE or COLLATION object that uses ICU,
we'll see the recorded [dat]collversion and try to find the ICU library
that provides that version.  If we can't, we'll fall back to using the
default ICU library with a warning that the user should either install
another ICU library version, or rebuild affected database objects and
REFRESH.

New GUCs:

icu_library_path

  A place to find ICU libraries, if not the default system library
  search path.

icu_library_versions

  A comma-separated list of ICU major or major.minor versions to make
  available to PostgreSQL, or * for every major version that can be
  found (the default).

default_icu_library_version

  The major or major.minor version to use for new objects and as a
  fallback (with warnings) if the right version can't be found.

Reviewed-by: Peter Eisentraut 
Reviewed-by: Jeff Davis 
Discussion: https://postgr.es/m/CA%2BhUKGL4VZRpP3CkjYQkv4RQ6pRYkPkSNgKSxFBwciECQ0mEuQ%40mail.gmail.com
---
 src/backend/access/hash/hashfunc.c|  16 +-
 src/backend/commands/collationcmds.c  |  20 +
 src/backend/utils/adt/formatting.c|  53 +-
 src/backend/utils/adt/pg_locale.c | 748 +-
 src/backend/utils/adt/varchar.c   |  16 +-
 src/backend/utils/adt/varlena.c   |  56 +-
 src/backend/utils/init/postinit.c |  34 +-
 src/backend/utils/misc/guc_tables.c   |  40 +-
 src/backend/utils/misc/postgresql.conf.sample |  10 +
 src/include/catalog/pg_proc.dat   |  23 +
 src/include/utils/pg_locale.h |  85 +-
 src/test/icu/meson.build  |   1 +
 src/test/icu/t/020_multiversion.pl| 274 +++
 src/tools/pgindent/typedefs.list  |   4 +
 14 files changed, 1275 insertions(+), 105 deletions(-)
 create mode 100644

Re: Collation version tracking for macOS

2022-11-23 Thread Thomas Munro

On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis  wrote:
> I'd vote for 1 on the grounds that it's easier to document and
> understand a single collation version, which comes straight from
> ucol_getVersion(). This approach makes it a separate problem to find
> the collation version among whatever libraries the admin can provide;
> but adding some observability into the search should mitigate any
> confusion.

OK, it sounds like I should code that up next.

> Can you go over the advantages of approaches 2-4 again? Is it just a
> concern about burdening the admin with finding the right ICU library
> version for a given collation version? That's a valid concern, but I
> don't think that should be an overriding design point. It seems more
> important to model the collation versions properly.

Yes, that's a good summary.  The user has a problem, and the solution
is to find some version of ICU and install it, so the problem space
necessarily involves the other kind of version.  My idea was that we
should therefore make that part of the model.  But the observability
support does indeed make it a bit clearer what's going on.

Re: Collation version tracking for macOS

2022-11-23 Thread Thomas Munro

On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis  wrote:
> I'm sure this has been discussed, but which distros even support
> multiple major versions of ICU?

For Debian and friends, you can install any number of libicuNN
packages (if you can find them eg from previous release repos), but
there's only one libicu-dev.  That means that one specific major
version is blessed by each Debian release and has its headers and
static libraries for you to use as a developer, but you can still
install the dynamic libraries from older releases at the same time to
satisfy the dependencies of packages or programs that were built on an
earlier OS release.  They don't declare conflicts on each other and
they contain non-conflicting filenames.  That's similar to the way
standard libraries and various other things are treated, for backward
compatibility.

For RHEL and friends, I'm pretty sure it's the same concept, but I
don't use those and haven't seen it with my own eyes.

I don't know for other Linux distros/families, but I expect the above
two cover a huge percentage of our users and I expect others to have
made similar choices.

For the BSDs, which tend to have a single binary package with both
headers and libraries owing to their origins as source-based
distributions (ports), the above way of thinking doesn't work; I
couldn't develop this on my usual FreeBSD battlestation without
building ICU myself (problem being that there's only one "pkg install
icu") and I hope to talk to someone who knows what to do about that
eventually.  I want this to work there easily for end users.

macOS and Windows have so many different ways of installing things
that there isn't a single answer there; supposedly open source is like
a bazaar and closed source like a cathedral, but as far as package
management goes, it looks more like rubble to me.

Re: Collation version tracking for macOS

2022-11-23 Thread Jeff Davis

On Wed, 2022-11-23 at 18:08 +1300, Thomas Munro wrote:

> (1) the default behaviour on failure to search would
> likely be to use the linked library instead and WARN about
> [dat]collversion mismatch, so far the same, and 

Agreed.

> (2) the set of people
> who would really be prepared to compile their own copy of 67.X
> instead
> of downgrading or REFRESHing (with or without rebuilding) is
> vanishingly small.

The set of people prepared to do so is probably small. But the set of
people who will do it (prepared or not) when a problem comes up is
significantly larger ;-)

> 1.  *Do* they change ucol_getVersion() values in minor releases?  I
> tried to find a written policy on that.

It seems like a valid concern. The mere existence of a collation
version separate from the library major version seems to suggest that
it's possible. Perhaps they avoid it in most cases; but absent a
specific policy against it, the separate collation version seems to
allow them the freedom to do so.

> This speculation feels pretty useless.  Maybe we should go and read
> the code or ask an ICU expert, but I'm not against making it
> theoretically possible to access two different minor versions at
> once,
> just to cover all the bases for future-proofing.

I don't think this should be an overriding concern that drives the
whole design. It is a nudge in favor of search-by-collversion.

> 2.  Would package managers ever allow two minor versions to be
> installed at once?  I highly doubt it; 

Agreed.

I'm sure this has been discussed, but which distros even support
multiple major versions of ICU?

> 
> 1.  search-by-collversion:  We introduce no new "library version"
> concept to COLLATION and DATABASE object and little or no new syntax.
> Whenever opening a collation or database, the system will search some
> candidate list of ICU libraries to try to find the one that agrees
> with [dat]collversion.

[...]

> The reason I prefer major[.minor] strings over whole library names is
> that we need to dlopen two of them so it's a little easier to build
> them from those parts than have to supply both names.

It also makes it easier to know which version suffixes to look for.

>   The reason I
> prefer to keep allowing major-only versions to be listed is that it's
> good to have the option to just follow minor upgrades automatically.

Makes sense.

> Or I guess you could make something that can automatically search a
> whole directory (which directory?) to find all the suitably named
> libraries so you don't ever have to mention versions manually (if you
> want "apt-get install libicu72" to be enough with no GUC change
> needed) -- is that too weird?

That seems to go a little too far.

>   SELECT * FROM pg_available_icu_libraries()
>   SELECT * FROM pg_available_icu_collation_versions('en')

+1

> 2.  lib-version-in-providers: We introduce a separate provider value
> for each ICU version, for example ICU63, plus an unversioned ICU like
> today.

I expressed interest in this approach before, but when you allowed ICU
compiled with --disable-renaming, that mitigated my concerns about when
to throw that error.

> 3.  lib-version-in-attributes: We introduce daticuversion (alongside
> datcollversion) and collicuversion (alongside collversion).

I think this is the best among 2-4.

> 4.  lib-version-in-locale:  "63:en" from earlier versions.  That was
> mostly a strawman proposal to avoid getting bogged down in
> syntax/catalogue/model change discussions while trying to prove that
> dlopen would even work.  It doesn't sound like anyone really likes
> this.

I don't see any advantage of this over 3.

> 5.  lib-version-in-collversion:  We didn't explicitly discuss this
> before, but you hinted at it: we could just use u_getVersion() in
> [dat]collversion.

The advantage here is that it's very easy to tell the admin what
library the collation is looking for, but the disadvantages you point
out seem a lot worse: migration problems from v15, and the minor
version question.

I'd vote for 1 on the grounds that it's easier to document and
understand a single collation version, which comes straight from
ucol_getVersion(). This approach makes it a separate problem to find
the collation version among whatever libraries the admin can provide;
but adding some observability into the search should mitigate any
confusion.

Can you go over the advantages of approaches 2-4 again? Is it just a
concern about burdening the admin with finding the right ICU library
version for a given collation version? That's a valid concern, but I
don't think that should be an overriding design point. It seems more
important to model the collation versions properly.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-22 Thread Thomas Munro

On Tue, Nov 22, 2022 at 7:34 PM Jeff Davis  wrote:
> On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> > Problem 2:  If ICU 67 ever decides to report a different version for
> > a
> > given collation (would it ever do that?  I don't expect so, but ...),
> > we'd be unable to open the collation with the search-by-collversion
> > design, and potentially the database.  What is a user supposed to do
> > then?  Presumably our error/hint for that would be "please insert the
> > correct ICU library into drive A", but now there is no correct
> > library
>
> Let's say that Postgres is compiled against version 67.X, and the
> sysadmin upgrades the ICU package to 67.Y, which reports a different
> collation version for some locale.
>
> Your current patch makes this impossible for the administrator to fix,
> because there's no way to have two different libraries loaded with the
> same major version number, so it will always pick the compiled-in ICU.
> The user will be forced to accept the new version of the collation, see
> WARNINGs in their logs, and possibly corrupt their indexes.

They could probably also 'pin' the older minor version package using
their package manager (= downgrade) until they're ready to upgrade and
use REFRESH VERSION to certify that they've rebuilt everything
relevant or are OK with risks.  Not pretty I admit, but I think the
end result is about the same for search-for-collversion, because I
imagine that (1) the default behaviour on failure to search would
likely be to use the linked library instead and WARN about
[dat]collversion mismatch, so far the same, and (2) the set of people
who would really be prepared to compile their own copy of 67.X instead
of downgrading or REFRESHing (with or without rebuilding) is
vanishingly small.

Two questions I wondered about:

1.  *Do* they change ucol_getVersion() values in minor releases?  I
tried to find a written policy on that.
https://icu.unicode.org/processes is not encouraging: it gives the
example of a "third digit in an official release number" [changing]
because a CLDR change was incorporated.  Hrmph.  But that's clearly
not even the modern ICU versioning system (it made a change a bit like
ours in 49, making the first number only major, so maybe that "third"
number is now the second number, AKA minor version), and also that's a
CLDR minor version change; is CLDR minor even in the recipe for
ucol_getVersion()?  Even without data changes, I guess that bug fixes
could apply to the UCA logic, and I assume that UCA logic is included
in it.  Hmm.

A non-hypothetical example of a CLDR change within an ICU major
version that I've been able to find is:

https://cldr.unicode.org/index/downloads/cldr-38

Here we see that CLDR had a minor version bump 38 -> 38.1, "a very
small number of incremental additions to version 38 to address the
specific bugs listed in Δ38.1", and was included in ICU 68.2.  Being a
minor ICU release 68.1 -> 68.2, perhaps you could finish up running
that just with a regular upgrade on typical distros (not a major OS
upgrade), and since PostgreSQL would normally be linked against eg
.68, not .68.1, it'd start using it at the next cluster start when
that symlink is updated to point to .68.2.  As it happens, if you
follow the documentation links to see what actually changed in that
particular pair of CLDR+ICU minor releases, it's timezones and locale
stuff other than collations, so wouldn't affect us.  Can we find a
chapter and verse that says that ICU would only ever move to a new
CLDR in a minor release, and CLDR would never change order of
pre-existing code points in a minor release?

It might be interesting to see if
https://github.com/unicode-org/icu/tree/release-68-1 and
https://github.com/unicode-org/icu/tree/release-68-2 report a
different ucol_getVersion() for any locale, but not conclusive if it
doesn't; it might be because something in the version pipeline knew
that particular CLDR change didn't affect collators...

This speculation feels pretty useless.  Maybe we should go and read
the code or ask an ICU expert, but I'm not against making it
theoretically possible to access two different minor versions at once,
just to cover all the bases for future-proofing.

2.  Would package managers ever allow two minor versions to be
installed at once?  I highly doubt it; they're probably more
interested in ABI stability so that dependent packages work when
bugfixes are shipped, and that's certainly nailed down at the major
version level.  It'd probably be a case of having to compile it
yourself, which seems unlikely to me in the real world.  That's why I
left minor version out of earlier patches, but I'm OK with changing
that.

As for how, I think that depends on our modelling decision (see below).

> Search-by-collversion would still be frustrating for the admin, but at
> least it would be possible to fix by compiling their own 67.X and
> asking Postgres to search that library, too. We could make it slightly
> more friendly by

Re: Collation version tracking for macOS

2022-11-21 Thread Jeff Davis

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> Problem 2:  If ICU 67 ever decides to report a different version for
> a
> given collation (would it ever do that?  I don't expect so, but ...),
> we'd be unable to open the collation with the search-by-collversion
> design, and potentially the database.  What is a user supposed to do
> then?  Presumably our error/hint for that would be "please insert the
> correct ICU library into drive A", but now there is no correct
> library

Let's say that Postgres is compiled against version 67.X, and the
sysadmin upgrades the ICU package to 67.Y, which reports a different
collation version for some locale.

Your current patch makes this impossible for the administrator to fix,
because there's no way to have two different libraries loaded with the
same major version number, so it will always pick the compiled-in ICU.
The user will be forced to accept the new version of the collation, see
WARNINGs in their logs, and possibly corrupt their indexes.

Search-by-collversion would still be frustrating for the admin, but at
least it would be possible to fix by compiling their own 67.X and
asking Postgres to search that library, too. We could make it slightly
more friendly by having an error that reports the libraries searched
and the collation versions found, if none of the versions match. We can
have a GUC that controls whether a failure to find the right version is
a WARNING or an ERROR.

On Sat, 2022-11-19 at 07:38 +1300, Thomas Munro wrote:
> >   * We'll need some clearer instructions on how to build/install
> > extra
> > ICU versions that might not be provided by the distribution
> > packaging.
> > For instance, I got a cryptic error until I used --enable-rpath,
> > which
> > might not be obvious to all users.
> 
> Suggestions welcome.  No docs at all yet...

I tried to write up some docs. It's hard to explain why we are exposing
to the user the collation version and the library version in these
different ways, and what effects they have.

The current patch feels like it hasn't decided whether the collation
version is ucol_getVersion() (collversion) or u_getVersion() (library
version). The collversion is more prominent in the UI (with its own
syntax), yet it's just a cross-check for whether to issue a WARNING or
not; while the library version is hidden in the locale field and it
actually decides which symbol is called.

> 
> 
> Yeah.  I just don't like the way it *appears* to be doing something
> clever, but
> it doesn't solve any fundamental problem at all because the
> collversion
> information is under human control and so it's really doing something
> stupid.

I assume by "human control" you mean "ALTER COLLATION ... REFRESH
VERSION". I agree that relying on the admin's declaration is dubious,
especially when we provide no good advice on how to actually do that
safely.

But I don't see what using the library version instead buys us here,
except that library version is part of the LOCALE, and there's no ALTER
command for that. You could just as easily deprecate/eliminate the
ALTER COLLATION REFRESH VERSION, and then say that the collversion is
out of human control, too.

By introducing multiple libraries, I think we need to change that
syntax anyway, to be something like:

   ALTER COLLATION ... SET VERSION TO '...'

or even:

   ALTER COLLATION ... FORCE VERSION TO '...'

> Hence desire to build something that at least admits that it's
> primitive and
> just gives you some controls, in a first version.

Using either the library version or the collation version seems
reasonably simple to me. But from a documentation and usability
standpoint, the way they are currently mixed seems confusing.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-18 Thread Thomas Munro

On Sat, Nov 19, 2022 at 7:38 AM Thomas Munro  wrote:
> On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis  wrote:
> > I realize your patch is experimental, but when there is a better
> > consensus on the approach, we should consider adding declarative syntax
> > such as:
> >
> >CREATE COLLATION (or LOCALE?) PROVIDER icu67
> >  TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';
> >
> > It will offer more opportunities to catch errors early and offer better
> > error messages. It would also enable it to function if the library is
> > built with --disable-renaming (though we'd have to trust the user).
>
> Earlier in this and other threads, we wondered if each ICU major version 
> should
> be a separate provider, which is what you're showing there, or should be an
> independent property of an individual COLLATION, which is what v6 did with
> '63:en' and what Peter suggested I make more formal with CREATE COLLATION foo
> (..., ICU_VERSION=63).  I actually started out thinking we'd have multiple
> providers, but I couldn't really think of any advantage, and I think it makes
> some upgrade scenarios more painful.  Can you elaborate on why you'd want
> that model?

Hmm, thinking some more about this... I said the above thinking that
you couldn't change a provider after creating a database/collation.
But what if you could?

1.  CREATE DATABASE x LOCALE_PROVIDER=icu ...;
2.  Some time later after an upgrade, my postgres binary is linked
against a new ICU version and I start seeing warnings.
3.  ALTER DATABASE x LOCALE_PROVIDER=icu63;

I suppose you shouldn't be allowed to change libc -> icu, but you
could change icu - > icuXXX, or I guess icuXXX -> icuXXX.

What if you didn't have to manually manage the set of available
providers with DDL like you showed, but we just automatically
supported "icu" (= the linked ICU, whatever it might be), and icu50 up
to icuXXX where XXX is the linked ICU's version?  We can encode those
values + libc as an int, to replace the existing char the represents
providers in catalogues.

That's basically just a different way of encoding the same information
that Peter was suggesting I put in a new catalogue attribute.  How do
you like that bikeshed colour?

Re: Collation version tracking for macOS

2022-11-18 Thread Thomas Munro

Replying to Peter and Jeff in one email.

On Sat, Nov 12, 2022 at 3:57 AM Peter Eisentraut
 wrote:
> On 22.10.22 03:22, Thomas Munro wrote:
> > I'd love to hear others' thoughts on how we can turn this into a
> > workable solution.  Hopefully while staying simple...
>
> I played with this patch a bit.  It looks like a reasonable approach.

Great news.

> Attached is a small patch to get the dynamic libicu* lookup working with
> the library naming on macOS.

Thanks, squashed.

> Instead of packing the ICU version into the locale field ('63:en'), I
> would make it a separate field in pg_collation and a separate argument
> in CREATE COLLATION.

I haven't tried this yet, as I focused on coming up with a way of testing in
this iteration.  I can try this next.  I'm imagining that we'd have
pg_collation.collicuversion and pg_database.daticuversion, and they'd default
to 0 for "use the GUC", and perhaps you'd even be able to ALTER them.  Perhaps
we wouldn't even need the GUC then...  0 could mean "the linked version", and
if you don't like it, you ALTER it.  Thinking about this.

> At this point, perhaps it would be good to start building some tests to
> demonstrate various upgrade scenarios and to ensure portability.

OK, here's what I came up with.  You enable it in PG_TEST_EXTRA, and
tell it about an alternative ICU version you have in the standard library
search path that is not the same as the main/linked one:

$ meson configure -DPG_TEST_EXTRA="icu=63"
$ meson test icu/020_multiversion

Another change from your feedback:  you mentioned that RHEL7 shipped with ICU
50, so I removed my suggestion of dropping some extra code we carry for
versions before 54 and set the minimum acceptable version to 50.  It probably
works further back than that, but that's a decent range, I think.

On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis  wrote:
> I looked at v6.

Thanks for jumping in and testing!

>   * We'll need some clearer instructions on how to build/install extra
> ICU versions that might not be provided by the distribution packaging.
> For instance, I got a cryptic error until I used --enable-rpath, which
> might not be obvious to all users.

Suggestions welcome.  No docs at all yet...

>   * Can we have a better error when the library was built with --
> disable-renaming? We can just search for the plain (no suffix) symbol.

I threw out that symbol probing logic, and wrote something simpler that should
now also work with --disable-renaming (though not tested).  Now it does a
cross-check with the library's self-reported major version, just to make
sure there wasn't a badly named library file, which may be more likely
with --disable-renaming.

>   * We should use dlerror() instead of %m to report dlopen() errors.

Fixed.

>   * It seems like the collation version is just there to issue WARNINGs
> when a user is using the non-versioned locale syntax and the library
> changes underneath them (or if there is collation version change within
> a single ICU major version)?

Correct.

I have now updated the warning messages you get when they don't match, to
provide a hint about what to do about it.  I am sure they need some more
word-smithing, though.

>   * How are you testing this?

Ad hoc noodling before now, but see attached.

> I realize your patch is experimental, but when there is a better
> consensus on the approach, we should consider adding declarative syntax
> such as:
>
>CREATE COLLATION (or LOCALE?) PROVIDER icu67
>  TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';
>
> It will offer more opportunities to catch errors early and offer better
> error messages. It would also enable it to function if the library is
> built with --disable-renaming (though we'd have to trust the user).

Earlier in this and other threads, we wondered if each ICU major version should
be a separate provider, which is what you're showing there, or should be an
independent property of an individual COLLATION, which is what v6 did with
'63:en' and what Peter suggested I make more formal with CREATE COLLATION foo
(..., ICU_VERSION=63).  I actually started out thinking we'd have multiple
providers, but I couldn't really think of any advantage, and I think it makes
some upgrade scenarios more painful.  Can you elaborate on why you'd want
that model?

> On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> > Problem 1:  Suppose you're ready to start using (say) v72.  I guess
> > you'd use the REFRESH command, which would open the main linked ICU's
> > collversion and stamp that into the catalogue, at which point new
> > sessions would start using that, and then you'd have to rebuild all
> > your indexes (with no help from PG to tell you how to find everything
> > that needs to be rebuilt, as belaboured in previous reverted work).
> > Aside from the possibility of getting the rebuilding job wrong (as
> > belaboured elsewhere), it's not great, because there is still a
> > transitional period where you can be using the wrong version for

Re: Collation version tracking for macOS

2022-11-14 Thread Jeff Davis

I looked at v6.

  * We'll need some clearer instructions on how to build/install extra
ICU versions that might not be provided by the distribution packaging.
For instance, I got a cryptic error until I used --enable-rpath, which
might not be obvious to all users.
  * Can we have a better error when the library was built with --
disable-renaming? We can just search for the plain (no suffix) symbol.
  * We should use dlerror() instead of %m to report dlopen() errors.
  * It seems like the collation version is just there to issue WARNINGs
when a user is using the non-versioned locale syntax and the library
changes underneath them (or if there is collation version change within
a single ICU major version)?
  * How are you testing this?
  * In my tests (sort, hacked so abbreviate is always false), I see a
~3% regression for ICU+UTF8. That's fine with me. I assume it's due to
the indirect function call, but that's not obvious to me from the
profile. If it's a major problem we could have a special case of
varstrfastcmp_locale() that works on the compile-time ICU version.

I realize your patch is experimental, but when there is a better
consensus on the approach, we should consider adding declarative syntax
such as:

   CREATE COLLATION (or LOCALE?) PROVIDER icu67
 TYPE icu VERSION '67' AS '/path/to/icui18n.so.67';

It will offer more opportunities to catch errors early and offer better
error messages. It would also enable it to function if the library is
built with --disable-renaming (though we'd have to trust the user).

On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote:
> Problem 1:  Suppose you're ready to start using (say) v72.  I guess
> you'd use the REFRESH command, which would open the main linked ICU's
> collversion and stamp that into the catalogue, at which point new
> sessions would start using that, and then you'd have to rebuild all
> your indexes (with no help from PG to tell you how to find everything
> that needs to be rebuilt, as belaboured in previous reverted work).
> Aside from the possibility of getting the rebuilding job wrong (as
> belaboured elsewhere), it's not great, because there is still a
> transitional period where you can be using the wrong version for your
> data.  So this requires some careful planning and understanding from
> the administrator.

How is this related to the search-by-collversion design? It seems like
it's hard no matter what.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Collation version tracking for macOS

2022-11-11 Thread Peter Eisentraut


On 22.10.22 03:22, Thomas Munro wrote:

I'd love to hear others' thoughts on how we can turn this into a
workable solution.  Hopefully while staying simple...


I played with this patch a bit.  It looks like a reasonable approach.

Attached is a small patch to get the dynamic libicu* lookup working with 
the library naming on macOS.


Instead of packing the ICU version into the locale field ('63:en'), I 
would make it a separate field in pg_collation and a separate argument 
in CREATE COLLATION.


At this point, perhaps it would be good to start building some tests to 
demonstrate various upgrade scenarios and to ensure portability.


From e236f5257bf0bf3e7b83b9d9b095d1d0e3fdc971 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 11 Nov 2022 15:44:44 +0100
Subject: [PATCH] fixup! WIP: Multi-version ICU.

---
 src/backend/utils/adt/pg_locale.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/backend/utils/adt/pg_locale.c 
b/src/backend/utils/adt/pg_locale.c
index 666a79b907a4..3ffb9706ff99 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1600,6 +1600,10 @@ get_icu_library(int major_version)
 "%s%sicui18n%d." DLSUFFIX,
 icu_library_path,
 icu_library_path[0] ? "\\" : "",
+#elif defined(__darwin__)
+"%s%slibicui18n.%d" DLSUFFIX,
+icu_library_path,
+icu_library_path[0] ? "/" : "",
 #else
 "%s%slibicui18n" DLSUFFIX ".%d",
 icu_library_path,
@@ -1615,6 +1619,10 @@ get_icu_library(int major_version)
 "%s%sicuuc%d." DLSUFFIX,
 icu_library_path,
 icu_library_path[0] ? "\\" : "",
+#elif defined(__darwin__)
+"%s%slibicuuc.%d" DLSUFFIX,
+icu_library_path,
+icu_library_path[0] ? "/" : "",
 #else
 "%s%slibicuuc" DLSUFFIX ".%d",
 icu_library_path,
-- 
2.38.1

Re: Collation version tracking for macOS

2022-11-08 Thread Thomas Munro

On Tue, Nov 8, 2022 at 1:22 AM Peter Eisentraut
 wrote:
> I made a Homebrew repository for ICU versions 50 through 72:
> https://github.com/petere/homebrew-icu

Nice!

> All of these packages build and pass their self-tests on my machine.  So
> from that experience, I think maintaining a repository of ICU versions,
> and being able to install more than one for testing this feature, is
> feasible.

I wonder what the situation with CVEs is in older releases.  I heard a
rumour that upstream might only patch current + previous, leaving it
up to distros to back-patch to whatever they need to support, but I
haven't tried to track down cold hard evidence of this or think about
what it means for this project...

Re: Collation version tracking for macOS

2022-11-07 Thread Peter Eisentraut


On 02.11.22 00:57, Thomas Munro wrote:

3.  Library availability.  This is a problem for downstream
communities to solve.  For example, the people who build Windows
installers might want to start bundling the ICU versions from their
earlier releases, the people involved with each Linux/BSD distro would
hopefully figure out a good way to publish the packages from older OS
releases in one repo, and the people running managed systems probably
do their own packaging anyway, they'll figure it out.  I realise that
you are involved in packaging and I am not, so we probably have
different perspectives: I get to say "and here, magic happens!" :-)


I made a Homebrew repository for ICU versions 50 through 72: 
https://github.com/petere/homebrew-icu


All of these packages build and pass their self-tests on my machine.  So 
from that experience, I think maintaining a repository of ICU versions, 
and being able to install more than one for testing this feature, is 
feasible.


Now I have started building PostgreSQL against these, to get some 
baseline of what is supported and actually works.  The results are a bit 
mixed so far, more to come later.


The installation instructions currently say that the minimum required 
version of ICU is 4.2.  That was the one that shipped with RHEL 6.  I 
think we have de-supported RHEL 6 and could increase that.  The version 
in RHEL 7 is 50.


(My repository happens to start at 50 because the new versioning system 
started at 49, but 49 doesn't appear to be tagged at the icu github site.)


Note: Recent versions of libxml2 link against icu.  This isn't a 
problem, thanks to the symbol versioning, but if you get libxml2 via 
pkg-config, you might get LDFLAGS from not the icu version you wanted.

Re: Collation version tracking for macOS

2022-11-01 Thread Thomas Munro

On Wed, Nov 2, 2022 at 1:42 AM Thomas Munro  wrote:
> On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut
>  wrote:
> > What I'm wondering is where those ICU installations are going to come
> > from.  In order for this project to be viable, we would need to convince
> > some combination of ICU maintainers, OS packagers, and PGDG packagers to
> > provide and maintain five year's worth of ICU packages (yearly releases
> > AFAICT).  Is that something we are willing to get into?
>
> I hacked on this on a Debian machine that has a couple of these
> installed and they work fine, but now I realise that might have to do
> with the major upgrade history of the machine.  So yeah... probably.
> :-/  Not being involved in packaging I have no idea how plausible such
> a backports (erm, forwardports?) repo would be, and I have even less
> idea for other distros.

After sleeping on it, I don't really agree that the project is not
viable even if it requires hoop-jumping to set up right now.  It's a
chicken-and-egg problem, and the first step is to make it possible to
do it at all, thereby creating the demand for convenient packages.  I
think we have several topics here:

1.  Technical problems relating to dlopen'ing.  Does it work?  Is the
default dlopen() secure enough?  Is it building sensible library
names, even on the freaky-library OSes (Windows, macOS, AIX)?  Is it
enough to have that GUC for non-default path, should it be a search
path, should it share the existing dynamic_library_path?  Are the
indirect function calls fast enough?  Is the way it handles API
stability sound?  Can we drop some unfinished complexity by dropping
pre-53 ICU?  Does it use too much memory?
2.  User experience problems relating to upgrade paths and user
interface.  Is it enough to start with the basic DB2-style approach
that I've prototyped here?  How should we refer to library versions?
Is your search-for-the-collversion idea better?  My gut feeling is
that the early version should be about giving people options, and not
trying to be too clever/automatic with questionable semantics, and
later improvements could follow, for example if we have another go at
the per-object version tracking.
3.  Library availability.  This is a problem for downstream
communities to solve.  For example, the people who build Windows
installers might want to start bundling the ICU versions from their
earlier releases, the people involved with each Linux/BSD distro would
hopefully figure out a good way to publish the packages from older OS
releases in one repo, and the people running managed systems probably
do their own packaging anyway, they'll figure it out.  I realise that
you are involved in packaging and I am not, so we probably have
different perspectives: I get to say "and here, magic happens!" :-)

FWIW at least 57, 63 and 67 (corresponding to deb9, 10, 11) from
http://ftp.debian.org/debian/pool/main/i/icu/ can be installed with
dpkg -i on my Debian 11 machine.  52 (deb8) too, probably, but it has
dependencies I didn't look into.  71 and 72 are newer than the -dev
version (what we link against), so I didn't try installing but the
patch as posted wouldn't let me open them: the idea here is to allow
only older stuff to be dlopen'd, so if a breaking API change comes
down the pipe we'll be able to deal with it.  Not being a packaging
guy, I don't how how stupid it would be to build a package repo that
literally just exposes these via an index and that's all, or whether
it's better to rebuild the ICU versions from source against modern
C/C++ runtimes etc.

Re: Collation version tracking for macOS

2022-11-01 Thread Thomas Munro

On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut
 wrote:
> What I'm wondering is where those ICU installations are going to come
> from.  In order for this project to be viable, we would need to convince
> some combination of ICU maintainers, OS packagers, and PGDG packagers to
> provide and maintain five year's worth of ICU packages (yearly releases
> AFAICT).  Is that something we are willing to get into?

I hacked on this on a Debian machine that has a couple of these
installed and they work fine, but now I realise that might have to do
with the major upgrade history of the machine.  So yeah... probably.
:-/  Not being involved in packaging I have no idea how plausible such
a backports (erm, forwardports?) repo would be, and I have even less
idea for other distros.

Re: Collation version tracking for macOS

2022-11-01 Thread Peter Eisentraut


On 22.10.22 03:22, Thomas Munro wrote:

Suppose your pgdata encounters a PostgreSQL linked against a later ICU
library, most likely after an OS upgrade or migratoin, a pg_upgrade,
or via streaming replication.  You might get a new error "can't find
ICU collation 'en' with version '153.14'; HINT: install missing ICU
library version", and somehow you'll have to work out which one might
contain 'en' v153.14 and install it with apt-get etc.  Then it'll
magically work: your postgres linked against (say) 71 will happily
work with the dlopen'd 67.  This is enough if you want to stay on 67
until the heat death of the universe.  So far so good.


What I'm wondering is where those ICU installations are going to come 
from.  In order for this project to be viable, we would need to convince 
some combination of ICU maintainers, OS packagers, and PGDG packagers to 
provide and maintain five year's worth of ICU packages (yearly releases 
AFAICT).  Is that something we are willing to get into?


(Even to test this I need to figure out where to get another ICU 
installation from.  I'll try how easy manual installations are.)

Re: Collation version tracking for macOS

2022-10-21 Thread Thomas Munro

On Sat, Oct 22, 2022 at 10:24 AM Thomas Munro  wrote:
> ... But it
> doesn't provide a way for me to create a new database that uses 63 on
> purpose when I know what I'm doing.  There are various reasons I might
> want to do that.

Thinking some more about this, I guess that could be addressed by
having an explicit way to request either the library version or
collversion-style version when creating a database or collation, but
not actually storing it in daticulocale/colliculocale.  That could be
done either as part of the string that is trimmed off before storing
it (so it's only used briefly during creation to find a non-default
library)... Perhaps that'd look like initdb --icu-locale "67:en" (ICU
library version) or "154.14:en" (individual collation version) or some
new syntax in a few places.  Thereafter, it would always be looked up
by searching for the right library by [dat]collversion as Peter E
suggested.

Let me try harder to vocalise some more thoughts that have stopped me
from trying to code the search-by-collversion design so far:

Suppose your pgdata encounters a PostgreSQL linked against a later ICU
library, most likely after an OS upgrade or migratoin, a pg_upgrade,
or via streaming replication.  You might get a new error "can't find
ICU collation 'en' with version '153.14'; HINT: install missing ICU
library version", and somehow you'll have to work out which one might
contain 'en' v153.14 and install it with apt-get etc.  Then it'll
magically work: your postgres linked against (say) 71 will happily
work with the dlopen'd 67.  This is enough if you want to stay on 67
until the heat death of the universe.  So far so good.

Problem 1:  Suppose you're ready to start using (say) v72.  I guess
you'd use the REFRESH command, which would open the main linked ICU's
collversion and stamp that into the catalogue, at which point new
sessions would start using that, and then you'd have to rebuild all
your indexes (with no help from PG to tell you how to find everything
that needs to be rebuilt, as belaboured in previous reverted work).
Aside from the possibility of getting the rebuilding job wrong (as
belaboured elsewhere), it's not great, because there is still a
transitional period where you can be using the wrong version for your
data.  So this requires some careful planning and understanding from
the administrator.

I admit that the upgrade story is a tiny bit better than the v5
DB2-style patch, which starts using the new version immediately if you
didn't use a prefix (and logs the usual warnings about collversion
mismatch) instead of waiting for you to run REFRESH.  But both of them
have a phase where they might use the wrong library to access an
index.  That's dissatisfying, and leads me to prefer the simple
DB2-style solution that at least admits up front that it's not very
clever.  The DB2-style patch could be improved a bit here with the
addition of one more GUC: default_icu_library, so the administrator,
rather than the packager, remains in control of which version we use
for non-prefixed iculocale values (likely to be what almost everyone
is interested in), defaulting to what the packager linked against.
I've added that to the patch for illustration (though obviously the
error messages produced by collversion mismatch could use some
adjustment, ie to clarify that the warning might be cleared by
installing and selecting a different library version).

Problem 2:  If ICU 67 ever decides to report a different version for a
given collation (would it ever do that?  I don't expect so, but ...),
we'd be unable to open the collation with the search-by-collversion
design, and potentially the database.  What is a user supposed to do
then?  Presumably our error/hint for that would be "please insert the
correct ICU library into drive A", but now there is no correct
library; if you can even diagnose what's happened, I guess you might
downgrade the ICU library using package tools or whatever if possible,
but otherwise you'd be stuck, if you just can't get the right library.
Is this a problem?  Would you want to be able to say "I don't care,
computer, please just press on"?  So I think we need a way to turn off
the search-by-collversion thing.  How should it look?

I'd love to hear others' thoughts on how we can turn this into a
workable solution.  Hopefully while staying simple...
From 0355984c9a80ff15bfac51677fea30b9be68226b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v6] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

By default, the system linker's default search path is used to find
libraries, but icu_library_path may be used to

Re: Collation version tracking for macOS

2022-10-21 Thread Thomas Munro

Hi,

Here is a rebase of this experimental patch.  I think the basic
mechanics are promising, but we haven't agreed on a UX.  I hope we can
figure this out.

Restating the choice made in this branch of the experiment:  Here I
try to be just like DB2 (if I understood its manual correctly).
In DB2, you can use names like "en_US" if you don't care about
changes, and names like "CLDR181_en_US" if you do.  It's the user's
choice to use the second kind to avoid "unexpected effects on
applications or database objects" after upgrades.  Translated to
PostgreSQL concepts, you can use a database default ICU locale like
"en-US" if you don't care and "67:en-US" if you do, and for COLLATION
objects it's the same.  The convention I tried in this patch is that
you use either "en-US-x-icu" (which points to "en-US") or
"en-US-x-icu67" (which points to "67:en-US") depending on whether you
care about this problem.

I recognise that this is a bit cheesy, it's all the user's problem to
deal with or ignore.

An alternative mentioned by Peter E was that the locale names
shouldn't carry the prefix, but somehow we should have a list of ICU
versions to search for a matching datcollversion/collversion.  How
would that look?  Perhaps a GUC, icu_library_versions = '63, 67, 71'?
There is a currently natural and smallish range of supported versions,
probably something like 54 ... U_ICU_VERSION_MAJOR_NUM, but it seems a
bit weird to try to dlopen ~25 libraries or whatever it might be...
Do you think we should try to code this up?

I haven't tried it, but the main usability problem I predict with that
idea is this:  It can cope with a scenario where you created a
database with ICU 63 and started using a default of "en" and maybe
some explicit fr-x-icu or whatever, and then you upgrade to a new
postgres binary using ICU 71, and, as long as you still have ICU 63
installed it'll just magicaly keep using 63, now via dlopen().  But it
doesn't provide a way for me to create a new database that uses 63 on
purpose when I know what I'm doing.  There are various reasons I might
want to do that.

Maybe the ideas could be combined?  Perhaps "en" means "create using
binary's linked ICU, open using search-by-collversion", while "67:en"
explicitly says which to use?

Changes since last version:

 * Now it just uses the default dlopen() search path, unless you set
icu_library_path.  Is that a security problem?  It's pretty
convenient, because it means you can just "apt-get install libicu63"
(or local equivalent) and that's all, now 63 is available.

 * To try the idea out, I made it automatically create "*-x-icu67"
alongside the regular "-x-icu" collation objects at initdb time.
From d3e83d0aa5cbb3eb192a2f66d68623cd3b1595b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v5] WIP: Multi-version ICU.

Add a layer of indirection when accessing ICU, so that multiple major
versions of the library can be used at once.  Versions other than the
one that PostgreSQL was linked against are opened with dlopen(), but we
refuse to open version higher than the one were were compiled against.
The ABI might change in future releases so that wouldn't be safe.

By default, the system linker's default search path is used to find
libraries, but icu_library_path may be used to specify an absolute path
to look in.  ICU libraries are expected to have been built without ICU's
--disable-renaming option.  That is, major versions must use distinct
symbol names.

This arrangement means that at least one major version of ICU is always
available -- the one that PostgreSQL was linked again.  It should be
simple on most software distributions to install extra versions using a
package manager, or to build extra libraries as required, to access
older ICU releases.  For example, on Debian bullseye the packages are
named libicu63, libicu67, libicu71.

In this version of the patch, '63:en' used as a database default locale
or COLLATION object requests ICU library 63, and 'en' requests the
library that is linked against the postgres executable.

XXX Many other designs possible, to discuss!

Discussion: https://postgr.es/m/CA%2BhUKGL4VZRpP3CkjYQkv4RQ6pRYkPkSNgKSxFBwciECQ0mEuQ%40mail.gmail.com
---
 src/backend/access/hash/hashfunc.c   |  16 +-
 src/backend/commands/collationcmds.c |  20 ++
 src/backend/utils/adt/formatting.c   |  53 +++-
 src/backend/utils/adt/pg_locale.c| 364 ++-
 src/backend/utils/adt/varchar.c  |  16 +-
 src/backend/utils/adt/varlena.c  |  56 +++--
 src/backend/utils/misc/guc_tables.c  |  14 ++
 src/include/utils/pg_locale.h|  73 ++
 src/tools/pgindent/typedefs.list |   3 +
 9 files changed, 549 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)

Re: Collation version tracking for macOS

2022-06-14 Thread Peter Eisentraut


On 14.06.22 21:10, Jeremy Schneider wrote:

Does Unicode CDLR provide (or even track) versioning of collation or other i18n 
functionality for individual locale settings?


Yes.  You can see that in PostgreSQL as various pre-seeded ICU 
collations having different versions.

Re: Collation version tracking for macOS

2022-06-14 Thread Jeremy Schneider

> On Jun 14, 2022, at 19:06, Thomas Munro  wrote:
> One difference would be the effect if ICU ever ships a minor library
> version update that changes the reported collversion.

If I’m reading it correctly, ICU would not change collation in major versions, 
as an explicit matter of policy around DUCET stability and versioning.

https://unicode.org/reports/tr10/#Stable_DUCET

> With some system of symlinks to make it all work with defaults for
> those who don't care, a libc could have
> /usr/share/locale/en...@cldr34.utf-8 etc so you could
> setlocale(LC_COLLATE, "en_US@CLDR34"), or something.  I suppose they
> don't want to promise to be able to interpret the old data in future
> releases, and, as you say, sometimes the changes are in C code, due to
> bugs or algorithm changes, not the data.

If I understand correctly, files in /usr/share/locale aren’t enough because 
those only have the tailoring rules, and core algorithm and data (before 
applying locale-specific tweaks) also change between versions. I’m pretty sure 
glibc works similar to UCA in this regard (albeit based on ISO 14651 and not 
CDLR), and the Unicode link above is a good illustration of default collation 
rules that underly the locale-specific tweaks.

-Jeremy

Sent from my TI-83

Re: Collation version tracking for macOS

2022-06-14 Thread Thomas Munro

On Wed, Jun 15, 2022 at 7:10 AM Jeremy Schneider
 wrote:
> > On Jun 14, 2022, at 14:10, Peter Eisentraut 
> >  wrote:
> > Conversely, why are we looking at the ICU version instead of the collation 
> > version.  If we have recorded the collation as being version 1234, we need 
> > to look through the available ICU versions (assuming we can load multiple 
> > ones somehow) and pick the one that provides 1234.  It doesn't matter 
> > whether it's the same ICU version that the collation was originally created 
> > with, as long as the collation version stays the same.

One difference would be the effect if ICU ever ships a minor library
version update that changes the reported collversion.

1.  With the code I proposed in my v4 patch, our version mismatch
warnings would kick in, but otherwise everything would continue to
work (and corrupt indexes, if they really moved anything around).
2.  With a system that (somehow) opens all available libraries and
looks for match, it would fail to find one.  That is assuming that you
are using the typical major-versioned packages we can see in software
distributions like Debian.

I don't know if minor version changes actually do that, though have
wondered out loud a few times in these threads.  I might go and poke
at some ancient packages to see if that's happened before.  To defend
against that, we could instead do major + minor versioning, but so far
I worried about major only because that's they way they ship 'em in
Debian and (AFAICS) RHEL etc, so if you can't easily install 68.0 and
68.1 at the same time.  On the other hand, you could always "pin" (or
similar concepts) the libicu68 package to a specific minor release, to
fix the problem (whether you failed like 1 or like 2 above).

> (Common mistake I’ve seen folks make when comparing OS glibc versions is only 
> looking at locale data, not realizing there have been changes to root 
> behavior that didn’t involve any changes to local data files)

Yeah, I've wondered idly before if libc projects and ICU couldn't just
offer a way to ask for versions explicitly, and ship historical data.
With some system of symlinks to make it all work with defaults for
those who don't care, a libc could have
/usr/share/locale/en...@cldr34.utf-8 etc so you could
setlocale(LC_COLLATE, "en_US@CLDR34"), or something.  I suppose they
don't want to promise to be able to interpret the old data in future
releases, and, as you say, sometimes the changes are in C code, due to
bugs or algorithm changes, not the data.

Re: Collation version tracking for macOS

2022-06-14 Thread Jeremy Schneider

> On Jun 14, 2022, at 14:10, Peter Eisentraut 
>  wrote:
> 
> Conversely, why are we looking at the ICU version instead of the collation 
> version.  If we have recorded the collation as being version 1234, we need to 
> look through the available ICU versions (assuming we can load multiple ones 
> somehow) and pick the one that provides 1234.  It doesn't matter whether it's 
> the same ICU version that the collation was originally created with, as long 
> as the collation version stays the same.

Does Unicode CDLR provide (or even track) versioning of collation or other i18n 
functionality for individual locale settings? I’m thinking it might not even 
have that concept in the original source repo/data, but I might be remembering 
wrong.

It would require not only watching for changes in the per-locale tailoring 
rules but also being cognizant of changes in root/DUCET behavior and 
understanding the impact of changes there.

(Common mistake I’ve seen folks make when comparing OS glibc versions is only 
looking at locale data, not realizing there have been changes to root behavior 
that didn’t involve any changes to local data files)

-Jeremy

Re: Collation version tracking for macOS

2022-06-14 Thread Peter Eisentraut


On 11.06.22 05:35, Peter Geoghegan wrote:

Do we even need to store a version for indexes most of the time if
we're versioning ICU itself, as part of the "time travelling
collations" design? For that matter, do we even need to version
collations directly anymore?


Conversely, why are we looking at the ICU version instead of the 
collation version.  If we have recorded the collation as being version 
1234, we need to look through the available ICU versions (assuming we 
can load multiple ones somehow) and pick the one that provides 1234.  It 
doesn't matter whether it's the same ICU version that the collation was 
originally created with, as long as the collation version stays the same.

Re: Collation version tracking for macOS

2022-06-13 Thread Peter Geoghegan

On Mon, Jun 13, 2022 at 5:41 PM Thomas Munro  wrote:
> It'd clearly be a terrible idea for us to try to use any of that, and
> Mac users should be very happy with the new support for ICU as DB
> default.

This suggests something that I already suspected: nobody particularly
expects the system lib C to be authoritative for the OS as a whole, in
the way that Postgres supposes. At least in the case of Mac OS, which
is after all purely a desktop operating system.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-13 Thread Thomas Munro

On Thu, Jun 9, 2022 at 11:33 AM Thomas Munro  wrote:
> On Thu, Jun 9, 2022 at 5:42 AM Tom Lane  wrote:
> > I'm sure that Apple are indeed updating the UTF8 data behind
> > their proprietary i18n APIs, but the libc APIs are mostly getting benign
> > neglect.
>
> As for how exactly they might be doing that, I don't know, but a bit
> of light googling tells me that a private, headerless,
> please-don't-call-me-directly copy of ICU arrived back in macOS
> 10.3[1].  I don't see it on my 12.4 system, but I also know that 12.x
> started hiding system libraries completely (the linker is magic and
> pulls libraries from some parallel dimension, there is no
> /usr/lib/libSystem.B.dylib file on disk, and yet otool -L
>  references it).

The other thread about a macOS linking problem nerd-sniped me back
into here to find out how to see breadcrumbs between hidden libraries
on this super weird UNIX™ and confirm that they are indeed still
shipping a private ICU for use by their Core Foundation stuff that's
used by fancy ObjC/Swift/... etc GUI apps.  The following command was
an interesting discovery for me because otool -L can't see any of the
new kind of ghost libraries:

% dyld_info -dependents
/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation

Though I can't get my hands on the hidden ICU library itself to
disassemble (without installing weird extra tools, apparently [1]),
that at least revealed its name, which I could then dlopen out of
curiosity.  It seems they jammed all the ICU sub-libraries into one,
and configured it with --disable-renaming so it doesn't have major
version suffixes on symbol names.

It'd clearly be a terrible idea for us to try to use any of that, and
Mac users should be very happy with the new support for ICU as DB
default.

[1] https://lapcatsoftware.com/articles/bigsur.html

Re: Collation version tracking for macOS

2022-06-12 Thread Thomas Munro

Hey Jeremy,

On Tue, Jun 7, 2022 at 12:42 PM Jeremy Schneider
 wrote:
> Thomas - thanks for the link back to one of the threads. I spent some time 
> reading through that and it’s a lot of material; I haven’t read the whole 
> thread yet. If you have some others that would also be particularly good 
> background, let me know. I’m doing a chunk of this in my spare time at the 
> moment, but I do want to keep getting more up to speed. I was pulled into a 
> bunch of various things related to PostgreSQL and ICU and collation and OS’s 
> over the past couple years, so I learned a lot from on-the-ground experience 
> and I am interested in trying to get a little more involved in the 
> conversation here.

There were more threads, but they mostly say the same things, hence my
current attempt to move from bloviation to trying out the ideas with
actual code :-D

> Personally, I really do think there should at least be an *option* to tell 
> the DB to fully error rather than just warn on version mismatch. Correctness 
> matters to many users, and being able to *trust* string comparisons are 
> correct is pretty damn fundamental all throughout a database. It really 
> doesn’t get any more basic and the potential for bad things to happen is 
> pretty astronomical, if you can’t trust those. I understand the consternation 
> about dealing with upgrades of large & busy databases, but I’m still 
> surprised that the community consensus arrived at the present behavior, and I 
> have a lot of reading to do, to really understand how that happened and where 
> the dialogue is today.

Given that the only thing you could do about it is REINDEX, and yet we
don't even know which indexes needed to be REINDEXed (the problem
Julien and I tried to address, but so far without success), it seemed
highly premature to convert the warning to an error.

I don't think the community consensus is that we have arrived
somewhere, it's more like we're in transit, possibly without a map.
For example 15 gains ICU support for the default collation (= how most
people consume collations), which changes things, and will surely lead
to more people thinking about this problem space.

> Multiple versions of ICU sounds nice for users who need real linguistic 
> collation (like what Oracle and DB2 offer), but I still feel like there needs 
> to be a super simple basic “pseudo-linguistic” collation baked in, that’s 
> “good enough” for 99% of users and that is guaranteed to be the same 
> everywhere on every platform and just won’t ever change. I think glibc needs 
> to be phased out somehow. At a minimum, not the default for new users… to 
> stop the bleeding. If MySQL wasn’t GPL then I’d say to just copy their 
> collations. I’d be reluctant to spend too much time on a POC now though, it 
> feels like my idea is the outlier and the general PG hacker consensus would 
> be to reject this idea. (But maybe I’m wrong?)

Hmm.  Well I personally wouldn't try to write a collator any more
willingly than I'd try to write a new cryptographic algorithm, just
not my bag.  We don't want to handle complaints about our sort order
(we already bat away complaints about glibc's, and I heard an account
from an OS vendor about the non-stop contradictory crank complaints
about sort order they used to receive before they gave up and adopted
ICU).  ICU really is quite authoritative here.

If you mean that you don't even want to have to specify a language
like "en", then note that you don't have to: ICU has a "root" collator
which you can request with an empty string (all other collators apply
cultural tweaks on top of that).  Like everything else, the root
collator has changed over time, though.

With my "distinct" experimental patch (v4), you could set your
database default collation to a specific ICU major version's root
collator like so:

create database ... locale_provider = icu icu_locale = '71:' template
= template0

That'll keep working, even across pg_upgrades to some version of
PostgreSQL far in the future that is linked against ICU 100, by
dlopen'ing the .71 libraries, for as long as you can get your hands on
a libicu71 package or otherwise build your own, and it'll error out if
it can't open that library, which may be the hard error you were
looking for.  If there's an API change in ICU we'll have to make some
changes, but that's already true.

Review/testing/flames/rants/better ideas welcome.

> Anyway, again, apologies for my pants-on-fire email last week. I hope I can 
> enjoy a few beers someday - or coffee for the non-drinkers - with a few other 
> PG collation nerds (which I never set out to be, but it may have befallen me 
> ).

+1

Re: Collation version tracking for macOS

2022-06-11 Thread Thomas Munro

On Sun, Jun 12, 2022 at 11:59 AM Thomas Munro  wrote:
>
> On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan  wrote:
> > What about "time travel collations", but without the time travel part?
> > That is, what about supporting multiple ICU versions per cluster, but
> > not per database? So you could upgrade the OS and Postgres, using
> > standard packages that typically just use the latest ICU version --
> > typically, but not always. If you happen to have been on an older
> > version of ICU on upgrade, then that version of ICU will still work at
> > the level of a whole database -- your database. Maybe you can create
> > new databases with old and new ICU versions if you want to.
> >
> > That obviously runs into the problem of needing to eventually do a
> > dump and reload -- but I suppose that "eventually" could be a very
> > long time. At least the OS package doesn't declare one version of ICU
> > the blessed version, now and forever, effectively vendoring ICU in a
> > backdoor fashion. At least old databases have significant runway,
> > while at the same time new databases that want to use the same
> > standard Postgres package aren't forced to use the same ancient ICU
> > version.
>
> Hmm.  I think that's effectively what you'd get using my "distinct
> collation" patch (v1, or this much better v3, attached), if you put
> version prefixes in colliculocale, and updated them in the template
> database after an OS upgrade to affect new databases.  I realise you
> probably mean something a little more automatic...

Thinking some more about what you said above: really, most people only
care about the default collation.  I'm not yet sure what I think
initdb should put into pg_collation when importing the initial set of
collation objects in the "distinct" world (perhaps an un-prefixed and
a prefixed variant of each, with names ending -x-icu and -x-icu63?),
but as for the default collation, I should point out that the
"distinct" patch already gives you a nailed-to-the-ground database
approximately as you described above if you just do something like
this:

postgres=# create database db2 locale_provider = icu icu_locale =
'67:en' template = template0 ...;

Small bugfix attached (v3 was accidentally calling uiter_setUTF8() and
u_errorName() directly in a couple of places).
From 265d6ea0d292255c164015814f4e810186d2ffb3 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v4] WIP: multi ICU, distinct collations

XXX This is highly experimental code
---
 src/backend/access/hash/hashfunc.c |  16 +-
 src/backend/utils/adt/formatting.c |  53 -
 src/backend/utils/adt/pg_locale.c  | 360 -
 src/backend/utils/adt/varchar.c|  16 +-
 src/backend/utils/adt/varlena.c|  56 ++---
 src/include/utils/pg_locale.h  |  72 ++
 src/tools/pgindent/typedefs.list   |   3 +
 7 files changed, 510 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..0a61538efd 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,11 @@ hashtext(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-	uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+	 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-			uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+			 uchar, ulen, buf, bsize);
 
 			result = hash_any(buf, bsize);
 
@@ -355,11 +355,11 @@ hashtextextended(PG_FUNCTION_ARGS)
 
 			ulen = icu_to_uchar(, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
 
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-	uchar, ulen, NULL, 0);
+			bsize = PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+	 uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-			uchar, ulen, buf, bsize);
+			PG_ICU_LIB(mylocale)->getSortKey(PG_ICU_COL(mylocale),
+			 uchar, ulen, buf, bsize);
 
 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
 
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index e909c1a200..97c96cb7f6 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1602,6 +1602,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 	 const UChar *src, int32_t srcLength,
 	 const char *locale,
 	 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+		const UChar *src, int32_t srcLength,
+		UBreakIterator *bi,
+		const char *locale,
+		UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t

Re: Collation version tracking for macOS

2022-06-11 Thread Thomas Munro

On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan  wrote:
> What about "time travel collations", but without the time travel part?
> That is, what about supporting multiple ICU versions per cluster, but
> not per database? So you could upgrade the OS and Postgres, using
> standard packages that typically just use the latest ICU version --
> typically, but not always. If you happen to have been on an older
> version of ICU on upgrade, then that version of ICU will still work at
> the level of a whole database -- your database. Maybe you can create
> new databases with old and new ICU versions if you want to.
>
> That obviously runs into the problem of needing to eventually do a
> dump and reload -- but I suppose that "eventually" could be a very
> long time. At least the OS package doesn't declare one version of ICU
> the blessed version, now and forever, effectively vendoring ICU in a
> backdoor fashion. At least old databases have significant runway,
> while at the same time new databases that want to use the same
> standard Postgres package aren't forced to use the same ancient ICU
> version.

Hmm.  I think that's effectively what you'd get using my "distinct
collation" patch (v1, or this much better v3, attached), if you put
version prefixes in colliculocale, and updated them in the template
database after an OS upgrade to affect new databases.  I realise you
probably mean something a little more automatic...

I think "pinned forever" ICU versions would be useful, because I think
there are very few expert users who want fine generalised control over
version changes, but almost all other users don't care at all about
any of this stuff -- as long as their indexes keep indexing and their
constraints keep constraining.  So I think you can make a lot of
people happy by ignoring the complexities of upgrades and providing a
way to nail the version down for the lifetime of the database.  Also,
it's not *impossible* to move to a later ICU, it's just a bit tricky;
the key point is that it's under your control if you want to do that,
independently of an OS upgrade, as you said.

Based on my reading of that DB2 manual page, I reckon my v3 "distinct
collation" patch is about as good as what they have.  If you don't
choose to use prefixes then later OS upgrades (ie upgrades that change
the version of ICU that PostgreSQL is linked against) might corrupt
your indexes and constraints -- I think that's what they're saying --
though at least we'll try to warn about that with our weak warning
system.  If you do choose to use prefixes you'll be stuck on that ICU
version forever, even across updates that cause PostgreSQL to be
linked to future releases of ICU, unless you're prepared to do a whole
bunch of careful upgrading work (perhaps with some future tooling to
help with that).

Attached is a much more fleshed out version of the "distinct
collation" patch (to be clear: v3 descends from v1, while v2 was at
attempt at the timelord approach).  Main changes:

1.  I now also route strTo{Upper,Lower,Title} to the right version of
ICU.  That requires dlopen'ing a second library.

2.  You create distinct collations with optional ICU major version
prefixes, along the lines of what Jim was suggesting.  For example
(and I'm not actually proposing -x-icu67 suffixes, just avoiding a
collision in my example):

postgres=# create collation "en-x-icu67" (provider = icu , locale = '67:en');
CREATE COLLATION
postgres=# create collation "en-x-icu63" (provider = icu , locale = '63:en');
CREATE COLLATION

3.  For versions other than the one we are linked against, it tries to
open libraries with typical names inside $libdir.  An administrator
could drop symlinks in there like so:

$ ln -s /usr/lib/x86_64-linux-gnu/libicui18n.so.63 ~/install/lib/postgresql/
$ ln -s /usr/lib/x86_64-linux-gnu/libicuuc.so.63 ~/install/lib/postgresql/

What I like about this $libdir scheme is that I imagine that we could
ask our friends in the packaging teams to create packages for that.
Users would then think of them in much the same way as extensions.
You'd just type:

$ sudo apt-get install postgresql-16-icu71

Happy to keep trying to figure out the competing and rather more
ambitious TT version too (that I sketched some bits of in v2), but I'm
a whole lot fuzzier on how that can work and kinda stuck on the
problems I raised.
From bf748af1b2576c232bdec633fb8466092fa7ce92 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v3] WIP: multi ICU, distinct collations

XXX This is highly experimental code
---
 src/backend/access/hash/hashfunc.c |  16 +-
 src/backend/utils/adt/formatting.c |  49 +++-
 src/backend/utils/adt/pg_locale.c  | 355 -
 src/backend/utils/adt/varchar.c|  16 +-
 src/backend/utils/adt/varlena.c|  54 ++---
 src/include/utils/pg_locale.h  |  69 ++
 src/tools/pgindent/typedefs.list   |   3 +
 7 files changed, 498 insertions(+), 64 deletions(-)

diff --git

Re: Collation version tracking for macOS

2022-06-10 Thread Peter Geoghegan

On Fri, Jun 10, 2022 at 9:08 PM Thomas Munro  wrote:
> They're still useful for non-ICU collations (for example FreeBSD and
> Windows can tell you about version changes based on open standards),
> and they're *maybe* still useful for ICU, considering that there
> are minor version upgrades, though I hope that would never actually
> detect a change if we built a multi-version system like what we are
> discussing here.

Right. I was mostly just asking this as a rhetorical question.

What about "time travel collations", but without the time travel part?
That is, what about supporting multiple ICU versions per cluster, but
not per database? So you could upgrade the OS and Postgres, using
standard packages that typically just use the latest ICU version --
typically, but not always. If you happen to have been on an older
version of ICU on upgrade, then that version of ICU will still work at
the level of a whole database -- your database. Maybe you can create
new databases with old and new ICU versions if you want to.

That obviously runs into the problem of needing to eventually do a
dump and reload -- but I suppose that "eventually" could be a very
long time. At least the OS package doesn't declare one version of ICU
the blessed version, now and forever, effectively vendoring ICU in a
backdoor fashion. At least old databases have significant runway,
while at the same time new databases that want to use the same
standard Postgres package aren't forced to use the same ancient ICU
version.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-10 Thread Peter Geoghegan

On Fri, Jun 10, 2022 at 8:47 PM Thomas Munro  wrote:
> I'm also suspicious that there are more subtle hazards like pathkeys
> lurking in the shadows.  We go to great effort to recognise matching
> and non-matching collations by OID alone, which is why my first
> attempt was "distinct [OIDs]", so that'd keep working.

It's definitely possible that we won't be able to find a workable
solution that deals with "time travel collations" sensibly from the
planner's perspective. It's certainly not a neat adjunct to what we
have. I think that it *might* be possible to find a way to make it
work that is suboptimal, but works. Without being overly clever.

The DB2 docs say "use the CLDR version prefix to avoid unexpected
changes in behavior when upgrading to future releases of the Db2
database". But if you don't do that, and get a change in behavior,
then surely any related indexes must have been rebuilt too. The
interesting part may be what that upgrade looks like in detail.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-10 Thread Thomas Munro

On Sat, Jun 11, 2022 at 3:36 PM Peter Geoghegan  wrote:
> Do we even need to store a version for indexes most of the time if
> we're versioning ICU itself, as part of the "time travelling
> collations" design? For that matter, do we even need to version
> collations directly anymore?

They're still useful for non-ICU collations (for example FreeBSD and
Windows can tell you about version changes based on open standards),
and they're *maybe* still useful for ICU, considering that there
are minor version upgrades, though I hope that would never actually
detect a change if we built a multi-version system like what we are
discussing here.  Certainly they don't make sense in the current
catalog layout with TT collations, though, there's only one attribute
to cover N libraries (though the reverted version tracking thing would
handle it just fine, because that moved it into a per-index location).

I mention minor upgrade as a topic to poke at because the popular
Linux distros only allow major ICU versions to be installed
concurrently, but minor versions are also released from time to time
and replace the libraries (well, the .68 library is a symlink to
.68.1, and then changes to .68.2, following typical conventions, but
the packages don't let you have .68.1 and .68.2 at the same time).  To
pick a random example, ICU upgraded 68.1 -> 68.2 at one point, which a
bit of googling tells me included CLDR 38 -> CLDR 38.1.  It looks like
they tweaked a few super minor things.  Could such a change affect the
values that ucol_getVersion() reports?  This came up in the last round
of this stuff with Doole[1], but we didn't dig further and I still
don't know what to think about it.

[1] 
https://www.postgresql.org/message-id/CADE5jYJTnYaTNXMFKOK-0p44%2BDm5LMcRcJ5kVi1MVHomb2QTkQ%40mail.gmail.com

Re: Collation version tracking for macOS

2022-06-10 Thread Thomas Munro

On Sat, Jun 11, 2022 at 2:29 PM Peter Geoghegan  wrote:
> The special REINDEX (or whatever) won't work as an atomic
> operation...but that doesn't mean that the system as a whole will have
> a mix of old and new physical collations forever, or even for very
> long. So while everything still has to work correctly, mediocre
> performance with certain kinds of plan shapes might be okay.

Yeah.  And if you don't like the negative effects of a system in
transition, you could also create new otherwise identical indexes, and
then only drop the old ones once that's done, and add enough brains to
keep everything working nicely on the old indexes until enough of the
new indexes exist.  Or something.  I was thinking that could also be
true for the "distinct collations" concept, only with different
problems to solve...

> As you kind of said yourself yesterday, "time travelling collations"
> could naturally have an invariant that worked at the index/constraint
> level (not the collation level): any given index needs to use only one
> underlying ICU version at the same time, no matter what. The invariant
> could perhaps be leveraged in the planner -- perhaps we start out with
> a working assumption that *every* index is on the newer ICU version
> (when at least one index is on the lastest and "now current" version),
> and work backwards by excluding indexes that still have old physical
> collations. Defining the problem as a problem with old
> indexes/constraints only seems like it might make things a lot easier.

Yes, that (posited) invariant was an enabling realisation for the
(unfinished, but IMHO useful to grok) v2 patch.  The disabling
realisation that stopped me from finishing it was that I doubt my
ability to find all the right places to wrap with
"pg_icu_activate_major_version(X)" and
"pg_icu_activate_major_version(-1)", and thus the whole approach.  Do
you know where to put all the ICU version switching regions, and how
to verify that they cover exactly all the right code, but don't leak
into any of the wrong code, and do you know where to store/retrieve X?
 It feels... flimsy to me, but how else could you make collations
behave differently when being called on behalf of some index rather
than some other thing, given no other context?  Explicitly passing a
pg_icu_library all over the place also sounds non-fun.

I'm also suspicious that there are more subtle hazards like pathkeys
lurking in the shadows.  We go to great effort to recognise matching
and non-matching collations by OID alone, which is why my first
attempt was "distinct [OIDs]", so that'd keep working.

I wondered if DB2's support looked more like "time travel" or
"distinct".  Based only on a quick glance at their manual[1], it looks
a bit like they have "I don't care" collations which are subject to
weirdness on upgrade when they change underneath your feet, and then
"distinct" collations which have an explicit prefix to pin down the
version (indirectly via CLDR version) and route to a specific library
(N copies of ICU that ship with it), with a note recommending the
latter for indexes and constraints.  So I'd guess you'll stay on the
old versions forever until you explicitly migrate data to a new
collation.

[1] https://www.ibm.com/docs/en/db2/11.1?topic=support-locale-names-sql-xquery

Re: Collation version tracking for macOS

2022-06-10 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 9:31 PM Thomas Munro  wrote:
> Perhaps that could be modeled with a pg_depend row pointing to a
> pg_icu_library row, which you'd probably need anyway, to prevent a
> registered ICU library that is needed for a live index from being
> dropped.  (That's assuming that the pg_icu_library catalogue concept
> has legs...  well if we're going with dlopen(), we'll need *somewhere*
> to store the shared object paths.  Perhaps it's not a given that we
> really want paths in a table... I guess it might prevent certain
> cross-OS streaming rep scenarios, but mostly that'd be solvable with
> symlinks...)

Do we even need to store a version for indexes most of the time if
we're versioning ICU itself, as part of the "time travelling
collations" design? For that matter, do we even need to version
collations directly anymore?

I'm pretty sure that the value of pg_collation.collversion is always
the same in practice, or has a lot of redundancy. Because mostly it's
just an ICU version. This is what I see on my system, at least:

pg@regression:5432 [53302]=# select count(*), collversion from
pg_collation where collprovider = 'icu' group by 2;
 count │ collversion
───┼─
   329 │ 153.112.41
   471 │ 153.112
(2 rows)

(Not sure why there are two different distinct collversion values
offhand, but generally looks like collversion isn't terribly
meaningful at the level of individual pg_collation entries.)

If indexes and constraints with old physical collations are defined as
being the exception to the general rule (the rule meaning "every index
uses the current ICU version for the database as a whole"), and if
those indexes/constraints are enumerated and stored (in a new system
catalog) when a switchover of the database's ICU version is first
initialized, then there might not be any meaningful dependency to
speak of. Not for indexes, at least.

The *database as a whole* is dependent on the current version of ICU
-- it's not any one index. Very occasionally the database will also be
dependent on a single older ICU version that we're still transitioning
away from. There is a "switch-a-roo" going on, but not really at the
level of indexes -- it's a very specialized thing, that works at the
level of the whole database, and involves exactly 2 ICU versions. You
should probably be able to back out of it once it begins, but mostly
it's an inflexible process that just does what we need it to do.

Does something like that seem sensible to you?

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-10 Thread Peter Geoghegan

On Fri, Jun 10, 2022 at 6:48 PM Thomas Munro  wrote:
> Executive summary of experiments so far: the "distinct collations"
> concept is quite simple and robust, but exposes all the versions to
> users and probably makes it really hard to upgrade (details not worked
> out), while the "time travelling collations" concept is nice for users
> but hard to pin down and prove correctness for since it seems to
> require dynamic scoping/global state changes affecting code in far
> away places.

It didn't really occur to me until now that the pg_dump problems that
come with the approach you outlined ("distinct collations") are likely
to be total blockers, and not just disadvantages. It's not just ICU
that prefers approximately correct behavior over throwing an "unknown
collation" error -- the same could be said for pg_dump itself. After
all, pg_dump doesn't care about collation versions -- except when run
in binary mode, for pg_upgrade, where it must satisfy the requirements
of pg_upgrade.

Even today we could be restoring to a server with an older ICU
version, where in general we might not get exactly the behavior the
user expects (though usually only when they've decided to use advanced
features like custom tailorings). So pg_dump already deliberately
disregards the ICU version, for essentially the same reasons that I
argued were good reasons upthread (when talking about a
multi-ICU-version Postgres via "time travelling collations").

Some more thoughts on "time travelling collations":

Doing a version switch in one atomic operation (like a special
REINDEX) isn't going to be practical. We need to be prepared for cases
where a database has a mix of indexes with old and new physical
collations. We certainly cannot allow queries to give wrong
answers...but I tend to doubt that (say) making merge joins work with
two indexes with different physical collations (though matching
logical collations) actually makes much sense. Maybe we can cut scope
in a pragmatic way instead.

The special REINDEX (or whatever) won't work as an atomic
operation...but that doesn't mean that the system as a whole will have
a mix of old and new physical collations forever, or even for very
long. So while everything still has to work correctly, mediocre
performance with certain kinds of plan shapes might be okay.

As you kind of said yourself yesterday, "time travelling collations"
could naturally have an invariant that worked at the index/constraint
level (not the collation level): any given index needs to use only one
underlying ICU version at the same time, no matter what. The invariant
could perhaps be leveraged in the planner -- perhaps we start out with
a working assumption that *every* index is on the newer ICU version
(when at least one index is on the lastest and "now current" version),
and work backwards by excluding indexes that still have old physical
collations. Defining the problem as a problem with old
indexes/constraints only seems like it might make things a lot easier.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-10 Thread Thomas Munro

On Fri, Jun 10, 2022 at 4:30 PM Thomas Munro  wrote:
> I'm not sold on any particular plan, but working through some examples
> helped me see your idea better...  I may try to code that up in a
> minimal way so we can kick the tyres...

I did a bit of hacking on that idea.  The goal was to stamp each index
with an ICU major version (not sure where, not done in the attached),
and if that doesn't match the library we're linked against, we'd try
to dlopen() libraries via symlinks with known name formats under
PGDATA/pg_icu_lib, which an administrator would have to create.  That
seemed a bit simpler than dealing with new catalogs for now...

See attached unfinished patch, which implements some of that.  It has
a single collation for en-US-x-icu, and routes calls to different
libraries depending on dynamic scope (which in cold hard reality
translates into a nasty global variable "current_icu_library").  The
idea was that it would normally point to the library we're linked
against, but whenever computing anything related to an index stamped
with ICU 63, we'd do pg_icu_activate_major_version(63), and afterwards
undo that.  Performance concerns aside, that now seems a bit too ugly
and fragile to me, and I gave up.  How could we convince ourselves
that we'd set the active ICU library correctly in all the required
dynamic scopes, but not leaked it into any other scopes?  Does that
even make sense?  But if not done like that, how else could we do it?

Better ideas/code welcome.

Executive summary of experiments so far: the "distinct collations"
concept is quite simple and robust, but exposes all the versions to
users and probably makes it really hard to upgrade (details not worked
out), while the "time travelling collations" concept is nice for users
but hard to pin down and prove correctness for since it seems to
require dynamic scoping/global state changes affecting code in far
away places.
From e842402fd6eeca413915c3808191d7928dfc0889 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 8 Jun 2022 17:43:53 +1200
Subject: [PATCH v2] WIP: allow multiple ICU libraries

XXX This is highly experimental code
---
 src/backend/access/hash/hashfunc.c |  18 +-
 src/backend/utils/adt/pg_locale.c  | 280 +++--
 src/backend/utils/adt/varchar.c|  16 +-
 src/backend/utils/adt/varlena.c|  57 +++---
 src/backend/utils/init/miscinit.c  |   9 +
 src/bin/initdb/initdb.c|   1 +
 src/include/utils/pg_locale.h  |  94 +-
 7 files changed, 420 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b57ed946c4..81571658ed 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -298,11 +298,13 @@ hashtext(PG_FUNCTION_ARGS)

 			ulen = icu_to_uchar(, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));

-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-	uchar, ulen, NULL, 0);
+			bsize = current_icu_library->getSortKey(pg_icu_collator(current_icu_library,
+	mylocale),
+	uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-			uchar, ulen, buf, bsize);
+			current_icu_library->getSortKey(pg_icu_collator(current_icu_library,
+			mylocale),
+			uchar, ulen, buf, bsize);

 			result = hash_any(buf, bsize);

@@ -352,14 +354,14 @@ hashtextextended(PG_FUNCTION_ARGS)
 			UChar	   *uchar = NULL;
 			Size		bsize;
 			uint8_t*buf;
+			UCollator  *ucol;

 			ulen = icu_to_uchar(, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));

-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-	uchar, ulen, NULL, 0);
+			ucol = pg_icu_collator(current_icu_library, mylocale);
+			bsize = current_icu_library->getSortKey(ucol, uchar, ulen, NULL, 0);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-			uchar, ulen, buf, bsize);
+			current_icu_library->getSortKey(ucol, uchar, ulen, buf, bsize);

 			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index a0490a7522..d9d17f1b8c 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -58,6 +58,7 @@
 #include "catalog/pg_collation.h"
 #include "catalog/pg_control.h"
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/hsearch.h"
@@ -78,6 +79,10 @@
 #include 
 #endif

+#ifdef HAVE_DLOPEN
+#include 
+#endif
+
 #define		MAX_L10N_DATA		80

@@ -1435,29 +1440,265 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }

+#ifdef USE_ICU
+
 struct pg_locale_struct default_locale;

+/* Table of ICU libraries we have loaded. */
+static pg_icu_library *icu_libraries[PG_NUM_ICU_MAJOR_VERSIONS];
+
+/*
+ * Currently active ICU library.  Normally this corresponds to the ICU library
+ * that we were

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro

On Fri, Jun 10, 2022 at 12:48 PM Tobias Bussmann  wrote:
> Perhaps I can shed some light on this matter:

Hi Tobias,

Oh, thanks for your answers.  Definitely a few bits of interesting
archeology I was not aware of.

> Apple's libc collations have always been a bit special in that concern, even 
> for the non-UTF8 ones. Rooted in ancient FreeBSD they "try to keep collating 
> table backward compatible with ASCII" thus upper and lower cases characters 
> are separated (There are exceptions like 'cs_CZ.ISO8859-2').

Wow.  I see that I can sort the English dictionary the way most people
expect by pretending it's Czech.  What a mess!

> With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't 
> see a difference between "C" and "en_US.UTF-8" but with "( echo '5£'; echo 
> '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( 
> echo '5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q 
> <(LC_COLLATE=C sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8 sort 
> /usr/share/dict/words)"

I see, so it does *something*, just not what anybody wants.

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro

On Fri, Jun 10, 2022 at 1:48 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro  wrote:
> > Well I can report that the system from ec483147 was hellishly
> > complicated, and not universally loved.  Which isn't to say that there
> > isn't a simple and loveable way to do it, waiting to be discovered,
> > and I do think we could fix most of the problems with that work.
>
> I admit that I don't have much idea of how difficult it would be to
> make it all work. I'm definitely not claiming that it's easy.

Hrrm... perhaps my memory of ec483147 is confusing me.  I think I'm
starting to come around to your idea a bit more now.  Let me sketch
out some more details here and see where this goes.

I *was* thinking that you'd have to find all references to collations
through static analysis, as we did in that version tracking project.
But perhaps for this you only need to record one ICU library version
for the whole index at build time, without any analysis at all, and it
would be used for any and all ICU collations that are reached while
evaluating anything to do with that index (index navigation, but also
eg WHERE clause for partial index, etc).  That would change to the
"current" value when you REINDEX.

Perhaps that could be modeled with a pg_depend row pointing to a
pg_icu_library row, which you'd probably need anyway, to prevent a
registered ICU library that is needed for a live index from being
dropped.  (That's assuming that the pg_icu_library catalogue concept
has legs...  well if we're going with dlopen(), we'll need *somewhere*
to store the shared object paths.  Perhaps it's not a given that we
really want paths in a table... I guess it might prevent certain
cross-OS streaming rep scenarios, but mostly that'd be solvable with
symlinks...)

One problem is that to drop an old pg_icu_library row, you'd have to
go and REINDEX everything, even indexes that don't really use
collations!  If you want to prove that an index doesn't use
collations, you're back in ec483147 territory.  Perhaps we don't care
about that and we're happy to let useless dependencies on
pg_icu_library rows accumulate, or to require useless work to be able
to drop them.

I'm not sure how we'd know what the "current" library version is.  The
highest numbered one currently in that pg_icu_library catalogue I
sketched?  So if I do whatever new DDL we invent to tell the system
about a new ICU library, and it's got a higher number than any others,
new indexes start using it but old ones keep using whatever they're
using.  Maybe with some way for users to override it, so users who
really want to use an older one when creating a new index can say so.

I suppose it would be the same for constraints.  For those,
considering that they need to be rechecked, the only way to change ICU
version would be to drop the constraint and recreate it.  Same goes
for range partitioned tables, right?  It'd keep using the old ICU
library until you drop the p table and create a new one, at which
point you're using the new current ICU library and it'll recheck all
your partitions against the constraints when you add them.  (Those
constraints are much simpler constants, so for those we could prove no
use of ICU without the general ec483147 beast.)

I think these things would have to survive pg_upgrade, but would be
lost on dump/restore.

There's still the pathkey problem to solve, and maybe some more
problems like that hiding somewhere.

I'm not sold on any particular plan, but working through some examples
helped me see your idea better...  I may try to code that up in a
minimal way so we can kick the tyres...

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro  wrote:
> Well I can report that the system from ec483147 was hellishly
> complicated, and not universally loved.  Which isn't to say that there
> isn't a simple and loveable way to do it, waiting to be discovered,
> and I do think we could fix most of the problems with that work.

I admit that I don't have much idea of how difficult it would be to
make it all work. I'm definitely not claiming that it's easy.

> I understand that that's not ideal from an
> end-user perspective, but maybe it's more realistically and robustly
> and simply implementable.  Hmm.

That may be a decisive reason to go with your proposal. I really don't know.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-09 Thread Tobias Bussmann

Am 08.06.2022 um 16:16 schrieb Tom Lane :
> The proposed patch would result in a warning about every collation-
> sensitive index during every macOS major version upgrade, ie about
> once a year for most people.  
> We need something that has at least *some* connection to actual changes.

In Postgres.app we introduced default collation versioning and warnings about 
possible mismatches from outside the actual server. When the user runs initdb 
with the GUI wrapper, the OS version and a checksum of the LC_COLLATE file of 
the used default collation is stored as meta-data. This allows to display a 
reindex warning on startup if the hash changes or we hardcode a known 
incompatible OS change.

Having collversion support on macOS within postgres would leverage the existing 
infrastructure for version change warnings and enables support for multiple 
collations. But I agree, we need something more specific than the major OS 
version here. Lacking any collation version information from the provider, a 
checksum on the binary LC_COLLATE file is the best I can come up with. 

Best regards,
Tobias

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro

On Fri, Jun 10, 2022 at 1:06 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro  wrote:
> > That sounds nice, but introduces subtle problems for the planner.  For
> > example, pathkeys that look compatible might not be, when
> > merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
> > could teach it about that, whereas with my distinct OID concept they
> > would already be considered non-matching automatically.
>
> Right -- my proposal is likely to be more difficult to implement.
> Seems like it might be worth going to the trouble of teaching the
> planner about this difference, though.

Well I can report that the system from ec483147 was hellishly
complicated, and not universally loved.  Which isn't to say that there
isn't a simple and loveable way to do it, waiting to be discovered,
and I do think we could fix most of the problems with that work.  It's
just that I was rather thinking of this new line of attack as being a
way to avoid the complications of identifying dependencies on moving
things through complicated analysis of object graphs and AST, by
instead attaching those slippery external things to the floor with a
nail gun.  That is, treating ICU 63 and ICU 67's collations as
completely unrelated.  I understand that that's not ideal from an
end-user perspective, but maybe it's more realistically and robustly
and simply implementable.  Hmm.

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro  wrote:
> That sounds nice, but introduces subtle problems for the planner.  For
> example, pathkeys that look compatible might not be, when
> merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
> could teach it about that, whereas with my distinct OID concept they
> would already be considered non-matching automatically.

Right -- my proposal is likely to be more difficult to implement.
Seems like it might be worth going to the trouble of teaching the
planner about this difference, though.

That exact issue seems like the true underlying problem to me: we have
two sets of behaviors for a given collation, that are equivalent for
some purposes (the user thinks of them as totally interchangeable),
but not for other purposes (we can't expect old indexes to continue to
work with a new physical collation for their logical collation). So
directly tackling that seems natural to me.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro

On Fri, Jun 10, 2022 at 12:32 PM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro  wrote:
> > You seem to have some
> > other idea in mind where the system only knows about one
> > "en-US-x-icu", but somehow, somewhere else (where?), keeps track of
> > which indexes were built with ICU 63 and which with ICU 67, which I
> > don't yet grok.  Or did I misunderstand?
>
> That's what I meant, yes -- you got it right.

OK, I see now.

I think if you design a system to record the library that each index
(and constraint, ...) was built with, it'd surely finish up being at
least conceptually something like the system Julien and I built and
then reverted in ec483147.  Except that it'd be a stronger form of
that, because instead of just squawking when the version is not the
latest/current version, it'd keep working but route collations to the
older library for indexes that haven't been rebuilt yet.

That sounds nice, but introduces subtle problems for the planner.  For
example, pathkeys that look compatible might not be, when
merge-joining an ICU 63 index scan against an ICU 67 index scan.  You
could teach it about that, whereas with my distinct OID concept they
would already be considered non-matching automatically.

Re: Collation version tracking for macOS

2022-06-09 Thread Tobias Bussmann

Thanks for picking this up!

> How can I see evidence of this?  I'm comparing Debian, FreeBSD and
> macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort
> /usr/share/dict/words" I get upper and lower case mixed together on
> the other OSes, but on the Mac the upper case comes first, which is my
> usual smoke test for "am I looking at binary sort order?"

Perhaps I can shed some light on this matter:

Apple's libc collations have always been a bit special in that concern, even 
for the non-UTF8 ones. Rooted in ancient FreeBSD they "try to keep collating 
table backward compatible with ASCII" thus upper and lower cases characters are 
separated (There are exceptions like 'cs_CZ.ISO8859-2'). The latest public 
sources I can find are in adv_cmds-119 [1] which belongs to OSX 10.5 [2] - 
these correspond to the ones used in FreeBSD till v10 [3], whereby the 
timestamps rather point its origin around FreeBSD 5. Further, there are only 
very few locales actually present on macOS (36 - none of it supporting Unicode) 
and these have not changed for a very long time (I verified that from OS X 
10.6.8 till macOS 12.4 [4], exception is a 'de_DE-A.ISO8859-1' present only in 
macOS 10.15).

What they do instead is symlinking [5] missing collations to similar ones even 
across encodings, often resulting in la_LN.US-ASCII ('la_LN' seem to stand for 
a Latin meta language) being used which is exactly byte order [6]. These 
symlinks have not changed [7] from OS X 10.6.8 till macOS 10.15.7. But in macOS 
11 many of these symlinks changed their target. So did the popular 
'en_US.UTF-8' from 'la_LN.US-ASCII' to 'la_LN.ISO8859-1' or 'de_DE.UTF-8' from 
'la_LN.US-ASCII' to 'de_DE.ISO8859-1'. In effect, about half of the UTF-8 
collations change from no collation to partial/broken collation support. macOS 
12 again shows no changes - tests for macOS 13 are outstanding.

# tl:dr;

With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't 
see a difference between "C" and "en_US.UTF-8" but with "( echo '5£'; echo '£5' 
) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( echo 
'5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C 
sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8 sort 
/usr/share/dict/words)"

The upside is that we don't have to cope with the new characters added in every 
version of Unicode (although I have not examined LC_CTYPE yet).

best regards
Tobias

[1]: 
https://github.com/apple-oss-distributions/adv_cmds/tree/adv_cmds-119/usr-share-locale.tproj/colldef
[2]: https://opensource.apple.com/releases/
[3]: https://github.com/freebsd/freebsd-src/tree/stable/10/share/colldef
[4]: find /usr/share/locale/*/LC_COLLATE -type f -exec md5 {} \;
[5]: 
https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/BSDmakefile
[6]: 
https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/la_LN.US-ASCII.src
[7]: find /usr/share/locale/*/LC_COLLATE -type l -exec stat -f "%N%SY" {} \;

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro  wrote:
> However, since you mentioned that a simple REINDEX would get you from
> one library version to another, I think we're making some completely
> different assumptions somewhere along the line, and I don't get your
> idea yet.  It sounds like you don't want two different collation OIDs
> in that case?

Not completely sure about the REINDEX behavior, but it's at least an
example of the kind of thing that could be enabled. I'm proposing that
pg_collation-wise collations have the most abstract possible
definitions -- "logical collations", which are decoupled from
"physical collations" that actually describe a particular ICU collator
associated with a particular ICU version (all the information that
keeps how the on-disk structure is organized for a given relfilenode
straight). In other words, the definition of a collation is the user's
own definition. To the user, it's pretty close to (maybe even exactly)
a BCP47 string, now and forever.

You can make arguments against the REINDEX behavior. And maybe those
arguments will turn out to be good arguments. Assuming that they are,
then the solution may just be to have a special option that will make
the REINDEX use the most recent library.

The important point is to make the abstraction as high level as
possible from the point of view of users.

> You seem to have some
> other idea in mind where the system only knows about one
> "en-US-x-icu", but somehow, somewhere else (where?), keeps track of
> which indexes were built with ICU 63 and which with ICU 67, which I
> don't yet grok.  Or did I misunderstand?

That's what I meant, yes -- you got it right.

Another way to put it would be to go as far as we can in the direction
of decoupling the concerns that we have as database people from the
concerns of natural language experts. Let's not step on their toes,
and let's avoid having our toes trampled on.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro

On Fri, Jun 10, 2022 at 10:29 AM Peter Geoghegan  wrote:
> On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim  wrote:
> > For example, an alternate syntax might be:
> >
> > create collation icu63."en-US-x-icu" (provider = icu, locale = 
> > 'en-US@colVersion=63');
>
> Why would a user want to specify an ICU version in DDL? Wouldn't that
> break in the event of a dump and reload of the database, for example?
> It also strikes me as being inconsistent with the general philosophy
> for ICU and the broader BCP45 IETF standard, which is "interpret the
> locale string to the best of our ability, never throw an error".
>
> Your proposed syntax already "works" today! You just need to create a
> schema called icu63 -- then the command executes successfully (for
> certain values of successfully).

Jim was proposing the @colVersion=63 part, but the schema part came
from my example upthread.  That was from a real transcript, and I
included that  because the way I've been thinking of this so far has
distinct collation OIDs for the "same" collation from different ICU
libraries, and yet I want them to have the same collname.  That is, I
don't want (say) "en-US-x-icu63" and "en-US-x-icu71"... I thought it'd
be nice to keep using "en-US-x-icu" as we do today, so if there are
two of them they'd *have* to be in different schemas.  That has the
nice property that you can use the search_path to avoid mentioning it.
But I'm not at all wedded to that idea, or any other ideas in this
thread, just trying stuff out...

However, since you mentioned that a simple REINDEX would get you from
one library version to another, I think we're making some completely
different assumptions somewhere along the line, and I don't get your
idea yet.  It sounds like you don't want two different collation OIDs
in that case?

The (vastly too) simplistic way I was thinking of it, if you have a
column with an ICU 63 collation, to switch to ICU 67 you first do some
DDL to add ICU 67 to your system and import 67's collations (creating
new collation OIDs), and then eg ALTER TABLE foo ALTER COLUMN bar TYPE
text COLLATE icu67."en-US-x-icu", which will rebuild your indexes.
That's a big job, and doesn't address how you switch the database
default collation.  None of that is very satisfying, much more thought
needed, but it falls out of the decision to have distinct
icu63."en-US-x-icu" and icu67."en-US-x-icu".  You seem to have some
other idea in mind where the system only knows about one
"en-US-x-icu", but somehow, somewhere else (where?), keeps track of
which indexes were built with ICU 63 and which with ICU 67, which I
don't yet grok.  Or did I misunderstand?

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 4:23 PM Thomas Munro  wrote:
> Suppose you pg_upgrade to something that is linked against 71.
> Perhaps you'd need to tell it how to dlopen 67 before you can open any
> collations with that library, but once you've done that your
> collation-dependent partition constraints etc should all hold.  I
> dunno, lots of problems to figure out here, including quite broad ones
> about various migration problems.  I haven't understood what Peter G
> is suggesting about how upgrades might work, so I'll go and try to do
> that...

I'm mostly just arguing for the idea that we should treat ICU versions
as essentially interchangeable in terms of their high-level
capabilities around collations and languages/scripts/whatever provided
for by the underlying CLDR version -- tools like pg_dump shouldn't
need to care about ICU versions per se. *ICU itself* should be
versioned, rather than having multiple independent ICU collation
providers. This should work as well as anything like this can ever be
expected to work -- because internationalization is just hard.

These remarks need to be interpreted in the context of how
internationalization is *supposed* to work under standards like BCP47
(again, this is a broad RFC about internationalization, not really an
ICU thing). Natural languages are inherently squishy, messy things.
The "default ICU collations" that initdb puts in pg_collation are not
really special to ICU -- we generate them through a quasi-arbitrary
process that iterates through top-level locales, which results in a
list that is a bit like what you get with libc collations. If you
pg_upgrade, you might have leftover "default ICU collations" that
wouldn't have been the default on a new initdb. It's inherently pretty
chaotic (because humans aren't as predictable as computers), which is
why BCP47 itself is so forgiving -- it literally has to be. Plus there
really isn't much downside to being so lax; as Jeremy pretty much said
already, the important thing is generally to have roughly the right
idea -- which this fuzzy approach mostly manages to do.

Let's not fight that. Let's leave the natural language stuff to the
experts, by versioning a single collation provider (like ICU), and
generalizing the definition of a collation along the same lines --
something that can be implemented using any available version of ICU
(with a preference for the latest on REINDEX, perhaps). It might turn
out that an older version does a slightly better job than a newer
version (regressions cannot be ruled out), but ultimately that's not
our problem. It can't be -- we're not the unicode consortium.

It's theoretically up to the user to make sure they're happy with any
behavioral changes under this scheme, perhaps by testing. They won't
actually test very often, of course, but that shouldn't matter in
practice. This is already what we advise for users that use advanced
tailorings of custom ICU collations, such as a custom collation for
"natural sorting", often used for things like alphanumeric invoice
numbers. That might break if you downgrade ICU version, and maybe even
if you upgrade ICU version.

--
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-09 Thread Thomas Munro

On Fri, Jun 10, 2022 at 9:20 AM Finnerty, Jim  wrote:
> Specifying the library name before the language-country code with a new 
> separator  (":") as you suggested below has some benefits.

One of the reasons for putting some representation of desired library
into the colliculocale column (rather than, say, adding a new column
pg_collation) is that I think we'd also want to be able to put that
into daticulocale (for the database default collation, when using
ICU).  But really I just did that because it was easy... perhaps, both
pg_collation and pg_database could gain a new column, and that would
be a little more pleasing from a schema design point of view (1NF
atomicity, and it's a sort of foreign key, or at least it would be if
there were another catalog to list library versions...)?

> Did you consider making the collation version just another collation 
> attribute, such as colStrength, colCaseLevel, etc.?
> For example, an alternate syntax might be:
>
> create collation icu63."en-US-x-icu" (provider = icu, locale = 
> 'en-US@colVersion=63');

Hmm, I hadn't considered that.  (I wouldn't call it "col" version BTW,
it's a library version, and we don't want to overload our terminology
for collation version.  We'd still be on the look out for collversion
changes coming from a single library's minor version changing, for
example an apt-get upgrade can replace the .63 files, which on most
systems are symlinks to .63.1, .63.2 etc. ☠️)

> Was the concern that ICU might redefine a new collation property with the 
> same name in a different and incompatible way (we might work with the ICU 
> developers to agree on what it should be), or that a version is just not the 
> same kind of collation property as the other collation properties?

Well my first impression is that we don't really own that namespace,
and since we're using this to decide which library to route calls to,
it seems nicer to put it at a "higher level" than those properties.
So I'd prefer something like "63:en-US", or 63 in a new column.

> (in the example above, I'm assuming that for provider = icu, we could 
> translate '63' into  'libicui18n.so.63' automatically.)

Yeah.  My patch that jams a library name in there was just the fastest
way I could think of to get something off the ground to test whether I
could route calls to different libraries (yes!), though at one moment
I thought it wasn't terrible.  But aside from any aesthetic complaints
about that way of doing it, it turns out not to be enough: we need to
dlopen() two different libraries, because we also need some ctype-ish
functions from this guy:

$ nm -D -C /usr/lib/x86_64-linux-gnu/libicuuc.so.63.1 | grep u_strToUpper
000d22c0 T u_strToUpper_63

I guess we probably want to just put "63" somewhere in pg_collation,
as you say.  But then, teaching PostgreSQL how to expand that to a
name that is platform/packaging dependent seems bad.  The variations
would probably be minor; on a Mac it's .dylib, on AIX it may be .a,
and the .63 convention may not be universal, I dunno, but some systems
might need absolute paths (depending on ld.so.conf etc), but that's
all stuff that I think an administrator should care about, not us.

Perhaps there could be a new catalog table just for that.  So far I
have imagined there would still be one special ICU library linked at
build time, which doesn't need to be dlopen'd, and works automatically
without administrators having to declare it.  So a system that has one
linked-in library version 67, and then has two extras that have been
added by an administrator running some new DDL commands might have:

postgres=# select * from pg_icu_library order by version;
 version |libicuuc|libicui18n
-++--
  58 | libicuuc.so.58 | libicui18n.so.58
  63 | libicuuc.so.63 | libicui18n.so.63
  67 ||
(3 rows)

Suppose you pg_upgrade to something that is linked against 71.
Perhaps you'd need to tell it how to dlopen 67 before you can open any
collations with that library, but once you've done that your
collation-dependent partition constraints etc should all hold.  I
dunno, lots of problems to figure out here, including quite broad ones
about various migration problems.  I haven't understood what Peter G
is suggesting about how upgrades might work, so I'll go and try to do
that...

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim  wrote:
> Specifying the library name before the language-country code with a new 
> separator  (":") as you suggested below has some benefits. Did you consider 
> making the collation version just another collation attribute, such as 
> colStrength, colCaseLevel, etc.?
> For example, an alternate syntax might be:
>
> create collation icu63."en-US-x-icu" (provider = icu, locale = 
> 'en-US@colVersion=63');

Why would a user want to specify an ICU version in DDL? Wouldn't that
break in the event of a dump and reload of the database, for example?
It also strikes me as being inconsistent with the general philosophy
for ICU and the broader BCP45 IETF standard, which is "interpret the
locale string to the best of our ability, never throw an error".

Your proposed syntax already "works" today! You just need to create a
schema called icu63 -- then the command executes successfully (for
certain values of successfully).

I'm not arguing against the need for something like this. I'm just
pointing out that there are good reasons to imagine that it would
largely be an implementation detail, perhaps only used to
unambiguously identify which specific ICU version and locale string
relate to which on-disk relfilenode structure currently.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-09 Thread Finnerty, Jim

Specifying the library name before the language-country code with a new 
separator  (":") as you suggested below has some benefits. Did you consider 
making the collation version just another collation attribute, such as 
colStrength, colCaseLevel, etc.?  
For example, an alternate syntax might be:  

create collation icu63."en-US-x-icu" (provider = icu, locale = 
'en-US@colVersion=63');

Was the concern that ICU might redefine a new collation property with the same 
name in a different and incompatible way (we might work with the ICU developers 
to agree on what it should be), or that a version is just not the same kind of 
collation property as the other collation properties?

(in the example above, I'm assuming that for provider = icu, we could translate 
'63' into  'libicui18n.so.63' automatically.)


On 6/8/22, 6:22 AM, "Thomas Munro"  wrote:


postgres=# create collation icu63."en-US-x-icu" (provider = icu,
locale = 'libicui18n.so.63:en-US');
CREATE COLLATION

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
 wrote:
> MySQL did the right thing here by doing what every other RDBMS did, and just 
> making a simple “good-enough” collation hardcoded in the DB, same across all 
> platforms, that never changes.

That's not true. Both SQL Server and DB2 have some notion of
collations that are versioned.

Oracle may not, but then Oracle also handles collations by indexing
strxfrm() blobs, with all of the obvious downsides that that entails
(far larger indexes, issues with index-only scans). That seems like an
excellent example of what not to do.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider
 wrote:
> I’m probably just going to end up rehashing the old threads I haven’t read 
> yet…
>
> One challenge with this approach is you have things like sort-merge joins 
> that require the same collation across multiple objects. So I think you’d 
> need to keep all the old indexes around until you have new indexes available 
> for all objects in a database, and somehow the planner would need to be smart 
> enough to dynamically figure out old vs new versions on a query-by-query 
> basis.

I don't think that it would be fundamentally difficult to have the
planner deal with collations at the level required to avoid incorrect
query plans.

I'm not suggesting that this is an easy project, or that the end
result would be totally free of caveats, such as the issue with merge
joins. I am only suggesting that something like this seems doable.
There aren't that many distinct high level approaches that could
possibly decouple upgrading Postgres/the OS from reindexing. This is
one.

> And my opinion is that the problems caused by depending on OS libraries for 
> collation need to be addressed on a shorter timeline than what’s realistic 
> for inventing a new way for a relational database to offer transparent or 
> online upgrades of linguistic collation versions.

But what does that really mean? You can use ICU collations as the
default for the entire cluster now. Where do we still fall short? Do
you mean that there is still a question of actively encouraging using
ICU collations?

I don't understand what you're arguing for. Literally everybody agrees
that the current status quo is not good. That much seems settled to
me.

> Also I still think folks are overcomplicating this by focusing on linguistic 
> collation as the solution.

I don't think that's true; I think that everybody understands that
being on the latest linguistic collation is only very rarely a
compelling feature. The whole way that BCP47 tags are so forgiving is
entirely consistent with that view of things.

But what difference does it make? As long as you accept that any
collation *might* need to be updated, or the default ICU version might
change on OS upgrade, then you have to have some strategy for dealing
with the transition. Not being on a very old obsolete version of ICU
will eventually become a "compelling feature" in its own right.

I believe that EDB adopted ICU many years ago, and stuck with one
vendored version for quite a few years. And eventually being on a very
old version of ICU became a real problem.

-- 
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-09 Thread Jeremy Schneider

> On Jun 8, 2022, at 22:40, Peter Geoghegan  wrote:
> 
> On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider
>  wrote:
>> Even if PG supports two versions of ICU, how does someone actually go about 
>> removing every dependency on the old version and replacing it with the new?
> 
> They simply REINDEX, without changing anything. The details are still
> fuzzy, but at least that's what I was thinking of.
> 
>> Can it be done without downtime? Can it be done without modifying a running 
>> application?
> 
> Clearly the only way that we can ever transition to a new "physical
> collation" is by reindexing using a newer ICU version. And clearly
> there is going to be a need to fully deprecate any legacy version of
> ICU on a long enough timeline. There is just no getting around that.

I’m probably just going to end up rehashing the old threads I haven’t read yet…

One challenge with this approach is you have things like sort-merge joins that 
require the same collation across multiple objects. So I think you’d need to 
keep all the old indexes around until you have new indexes available for all 
objects in a database, and somehow the planner would need to be smart enough to 
dynamically figure out old vs new versions on a query-by-query basis. May need 
an atomic database-wide cutover; running a DB with internally mixed collation 
versions doesn’t seem like a small challenge. It would require enough disk 
space for two copies of all indexes, and queries would change which indexes 
they use in a way that wouldn’t be immediately obvious to users or app dev. 
Suddenly switching to or from a differently-bloated index could result in 
confusing and sudden performance changes.

Also there would still need to be a plan to address all the other non-index 
objects where collation is used, as has been mentioned before.

And given the current architecture, that final “alter database update default 
collation” command still seems awful risky, bug-prone and difficult to get 
correct. At least it seems that way to me.

At a minimum, this is a very big project and it seems to me like it may be wise 
to get more end-to-end fleshing out of the plans before committing incremental 
pieces in core (which could end up being misguided if the plan doesn’t work as 
well as assumed). Definitely doesn’t seem to me like anything that will happen 
in a year or two.

And my opinion is that the problems caused by depending on OS libraries for 
collation need to be addressed on a shorter timeline than what’s realistic for 
inventing a new way for a relational database to offer transparent or online 
upgrades of linguistic collation versions.

Also I still think folks are overcomplicating this by focusing on linguistic 
collation as the solution. Like 1% of users actually need or care about having 
the latest technically correct local-language-based sorting, at a database 
level. MySQL did the right thing here by doing what every other RDBMS did, and 
just making a simple “good-enough” collation hardcoded in the DB, same across 
all platforms, that never changes.

The 1% of users who need true linguistic collation can probably deal with the 
trade-off of dump-and-load upgrades for their ICU indexes and databases for a 
few more years.

-Jeremy

Sent from my TI-83

Re: Collation version tracking for macOS

2022-06-09 Thread Peter Geoghegan

On Wed, Jun 8, 2022 at 10:39 PM Peter Geoghegan  wrote:
> They simply REINDEX, without changing anything. The details are still
> fuzzy, but at least that's what I was thinking of.

As I said before, BCP47 format tags are incredibly forgiving by
design. So it should be reasonable to assume that anything that has
worked in an earlier version of ICU will continue to work in a way
that's at least as useful in a future version. See:

https://www.postgresql.org/message-id/CAH2-Wz=ZrA5Yf55pKtdJb2pYCVN=2dh__vgr9arqqohmqwg...@mail.gmail.com

That's not strictly guaranteed, because sometimes countries cease to
exist, and their ISO country codes eventually go away too. But that
still tends to fail gracefully. It's mostly only relevant for things
that are part of a locale, which is a broader concept than just
collation. An application that did this and relied on ICU for
localization might then find that the currency sign changed, but I'm
not aware of any impact on locales. You can ask for total nonsense
and mostly get reasonable behaviors, like Japanese as spoken in
Iceland. Even some totally made up (or misspelled) country is
accepted without complaint.

--
Peter Geoghegan

Re: Collation version tracking for macOS

2022-06-08 Thread Peter Geoghegan

On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider
 wrote:
> Even if PG supports two versions of ICU, how does someone actually go about 
> removing every dependency on the old version and replacing it with the new?

They simply REINDEX, without changing anything. The details are still
fuzzy, but at least that's what I was thinking of.

This should be possible by generalizing the definition of a collation
to recognize that different ICU versions can support the same
collation. Of course we'd also have to remember which actual ICU
version and specific "physical collation" was currently in use by each
index. We'd also probably have to have some policy about which ICU
version was the latest (or some suitably generalized version of that
that applies to collation providers more generally).

> Can it be done without downtime? Can it be done without modifying a running 
> application?

Clearly the only way that we can ever transition to a new "physical
collation" is by reindexing using a newer ICU version. And clearly
there is going to be a need to fully deprecate any legacy version of
ICU on a long enough timeline. There is just no getting around that.

The advantage of an approach along the lines that I've laid out is
that everything can be done incrementally, possibly some time after an
initial OS or Posgres upgrade, once everything has settled. Much much
later, even. If the same new ICU version isn't available in your
original/old environment (which is likely), you can avoid reindexing,
and so reserve the option of backing out of a complex upgrade until
very late in the process. You're going to have to do it eventually,
but it can probably just be an afterthought.

-- 
Peter Geoghegan

1 2 >

1 - 100 of 143 matches

Mail list logo