On Tue, Nov 22, 2022 at 7:34 PM Jeff Davis <pg...@j-davis.com> wrote: > On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote: > > Problem 2: If ICU 67 ever decides to report a different version for > > a > > given collation (would it ever do that? I don't expect so, but ...), > > we'd be unable to open the collation with the search-by-collversion > > design, and potentially the database. What is a user supposed to do > > then? Presumably our error/hint for that would be "please insert the > > correct ICU library into drive A", but now there is no correct > > library > > Let's say that Postgres is compiled against version 67.X, and the > sysadmin upgrades the ICU package to 67.Y, which reports a different > collation version for some locale. > > Your current patch makes this impossible for the administrator to fix, > because there's no way to have two different libraries loaded with the > same major version number, so it will always pick the compiled-in ICU. > The user will be forced to accept the new version of the collation, see > WARNINGs in their logs, and possibly corrupt their indexes.
They could probably also 'pin' the older minor version package using their package manager (= downgrade) until they're ready to upgrade and use REFRESH VERSION to certify that they've rebuilt everything relevant or are OK with risks. Not pretty I admit, but I think the end result is about the same for search-for-collversion, because I imagine that (1) the default behaviour on failure to search would likely be to use the linked library instead and WARN about [dat]collversion mismatch, so far the same, and (2) the set of people who would really be prepared to compile their own copy of 67.X instead of downgrading or REFRESHing (with or without rebuilding) is vanishingly small. Two questions I wondered about: 1. *Do* they change ucol_getVersion() values in minor releases? I tried to find a written policy on that. https://icu.unicode.org/processes is not encouraging: it gives the example of a "third digit in an official release number" [changing] because a CLDR change was incorporated. Hrmph. But that's clearly not even the modern ICU versioning system (it made a change a bit like ours in 49, making the first number only major, so maybe that "third" number is now the second number, AKA minor version), and also that's a CLDR minor version change; is CLDR minor even in the recipe for ucol_getVersion()? Even without data changes, I guess that bug fixes could apply to the UCA logic, and I assume that UCA logic is included in it. Hmm. A non-hypothetical example of a CLDR change within an ICU major version that I've been able to find is: https://cldr.unicode.org/index/downloads/cldr-38 Here we see that CLDR had a minor version bump 38 -> 38.1, "a very small number of incremental additions to version 38 to address the specific bugs listed in Δ38.1", and was included in ICU 68.2. Being a minor ICU release 68.1 -> 68.2, perhaps you could finish up running that just with a regular upgrade on typical distros (not a major OS upgrade), and since PostgreSQL would normally be linked against eg .68, not .68.1, it'd start using it at the next cluster start when that symlink is updated to point to .68.2. As it happens, if you follow the documentation links to see what actually changed in that particular pair of CLDR+ICU minor releases, it's timezones and locale stuff other than collations, so wouldn't affect us. Can we find a chapter and verse that says that ICU would only ever move to a new CLDR in a minor release, and CLDR would never change order of pre-existing code points in a minor release? It might be interesting to see if https://github.com/unicode-org/icu/tree/release-68-1 and https://github.com/unicode-org/icu/tree/release-68-2 report a different ucol_getVersion() for any locale, but not conclusive if it doesn't; it might be because something in the version pipeline knew that particular CLDR change didn't affect collators... This speculation feels pretty useless. Maybe we should go and read the code or ask an ICU expert, but I'm not against making it theoretically possible to access two different minor versions at once, just to cover all the bases for future-proofing. 2. Would package managers ever allow two minor versions to be installed at once? I highly doubt it; they're probably more interested in ABI stability so that dependent packages work when bugfixes are shipped, and that's certainly nailed down at the major version level. It'd probably be a case of having to compile it yourself, which seems unlikely to me in the real world. That's why I left minor version out of earlier patches, but I'm OK with changing that. As for how, I think that depends on our modelling decision (see below). > Search-by-collversion would still be frustrating for the admin, but at > least it would be possible to fix by compiling their own 67.X and > asking Postgres to search that library, too. We could make it slightly > more friendly by having an error that reports the libraries searched > and the collation versions found, if none of the versions match. We can > have a GUC that controls whether a failure to find the right version is > a WARNING or an ERROR. Good ideas. > I tried to write up some docs. It's hard to explain why we are exposing > to the user the collation version and the library version in these > different ways, and what effects they have. Always a good test: see how crazy it sounds when translated to user speak. > The current patch feels like it hasn't decided whether the collation > version is ucol_getVersion() (collversion) or u_getVersion() (library > version). The collversion is more prominent in the UI (with its own > syntax), yet it's just a cross-check for whether to issue a WARNING or > not; while the library version is hidden in the locale field and it > actually decides which symbol is called. Yeah. I agree that it sucks to have two kinds of versions flying around in the user's mind. > > Yeah. I just don't like the way it *appears* to be doing something > > clever, but > > it doesn't solve any fundamental problem at all because the > > collversion > > information is under human control and so it's really doing something > > stupid. > > I assume by "human control" you mean "ALTER COLLATION ... REFRESH > VERSION". I agree that relying on the admin's declaration is dubious, > especially when we provide no good advice on how to actually do that > safely. > > But I don't see what using the library version instead buys us here, > except that library version is part of the LOCALE, and there's no ALTER > command for that. You could just as easily deprecate/eliminate the > ALTER COLLATION REFRESH VERSION, and then say that the collversion is > out of human control, too. > > By introducing multiple libraries, I think we need to change that > syntax anyway, to be something like: > > ALTER COLLATION ... SET VERSION TO '...' > > or even: > > ALTER COLLATION ... FORCE VERSION TO '...' OK. Time for a new list of the various models we've discussed so far: 1. search-by-collversion: We introduce no new "library version" concept to COLLATION and DATABASE object and little or no new syntax. Whenever opening a collation or database, the system will search some candidate list of ICU libraries to try to find the one that agrees with [dat]collversion. When creating a new collation or database, the system will select one (probably the linked one unless you override somehow) and record ucol_getVersion() in [dat]collversion. When searching, it might fail to find a suitable library and ereport; to fix that, it is the admin's job to somehow expand the set of candidate libraries. In such a failure case, perhaps it would fall back to using some default library version (probably the one that is linked, overridable by GUC?), with a WARNING (unless you turned on ERRORs), and if you want to shut it up without supplying the right candidate library, you can still fall back to the REFRESH VERSION hammer (or maybe that should indeed called FORCE to make it clearer that it's not a harmless operation where the system holds your hand, you're actually certifying that you have rebuilt indexes and you know what you're doing). The set of candidate versions could perhaps be provided with extra_icu_library_versions=63,71 OR =63.1,63.2 strings, at least on Unix systems following the traditional symlink conventions. Remembering that a typical Unixoid system should have libraries and symlinks like: libicui18n.a libicui18n.so -> libicui18n.so.71.1 libicui18n.so.63 -> libicui18n.so.63.1 libicui18n.so.63.1 libicui18n.so.67 -> libicui18n.so.67.1 libicui18n.so.67.1 libicui18n.so.71 -> libicui18n.so.71.1 libicui18n.so.71.1 The reason I prefer major[.minor] strings over whole library names is that we need to dlopen two of them so it's a little easier to build them from those parts than have to supply both names. The reason I prefer to keep allowing major-only versions to be listed is that it's good to have the option to just follow minor upgrades automatically. Or I guess you could make something that can automatically search a whole directory (which directory?) to find all the suitably named libraries so you don't ever have to mention versions manually (if you want "apt-get install libicu72" to be enough with no GUC change needed) -- is that too weird? Perhaps we could write functions that can show you the available versions to demystify the searching mechanism slightly and show how various numbers relate, something like (warning: I made up numbers for illustration, they are wrong!): SELECT * FROM pg_available_icu_libraries() icu_version unicode_version uca_version cldr_version 67.1 14.0 3.1 38.0 71.1 15.0 4.0 42.0 SELECT * FROM pg_available_icu_collation_versions('en') icu_version collation_version 67.1 142.42 71.1 153.112 2. lib-version-in-providers: We introduce a separate provider value for each ICU version, for example ICU63, plus an unversioned ICU like today. The collversion column is used only for warnings. Warnings are expected when you used the unversioned ICU provider and upgrade to a binary linked to a later library. You can clear the warnings by doing ALTER COLLATION/DATABASE SET [LOCALE_]PROVIDER = ICU63, or with the REFRESH VERSION hammer. Not sure how you fit minor versions into that, if we want to support those. Maybe ICU means "whatever is linked", ICU63 means "whatever libicui18n.so.63 points to" and ICU63_1 means libicu18n.so.63.1, something like that, so the user can choose from three levels of specificity. 3. lib-version-in-attributes: We introduce daticuversion (alongside datcollversion) and collicuversion (alongside collversion). Similar to the above, but it's a separate property and the provider is always ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change ICU_VERSION. 4. lib-version-in-locale: "63:en" from earlier versions. That was mostly a strawman proposal to avoid getting bogged down in syntax/catalogue/model change discussions while trying to prove that dlopen would even work. It doesn't sound like anyone really likes this. 5. lib-version-in-collversion: We didn't explicitly discuss this before, but you hinted at it: we could just use u_getVersion() in [dat]collversion. I haven't analysed this much but I don't think it has a very nice upgrade path from PG15, and it forces you to decide whether to store just the major version and not even notice when the (unstored) minor version changes, or store major.minor and complain/break down when routine minor upgrades happen. It is a logical possibility though, once you decide you only want one kind of version in the system. I'm willing to update the patch to try one of these out so we can kick the tyres some more, but I'll wait to see if we can get some consensus on the way forward. Despite my initial reactions, I'm willing to try out the search-by-collversion concept if others are keen on it. The example I worked through in the first paragraph of this email helped me warm to it a little, and with the observability functions I showed you might have a chance of figuring out what's going on in some edge cases. Any other ideas, or votes for these ideas?