On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > Here is a patch I've been working on to allow the use of ICU for sorting > and other locale things.
This is very interesting work, and it's great to see some development in this area. I've been peripherally involved in more than one collation-change-broke-my-data incident over the years. I took the patch for a quick spin today. Here are a couple of initial observations. This patch adds pkg-config support to our configure script, in order to produce the build options for ICU. That's cool, and I'm a fan of pkg-config, but it's an extra dependency that I just wanted to highlight. For example MacOSX appears to ship with ICU but has is no pkg-config or ICU .pc files out of the box (erm, I can't even find the headers, so maybe that copy of ICU is useless to everyone except Apple). The MacPorts ICU port ships with .pc files, so that's easy to deal with, and I don't expect it to be difficult to get a working pkg-config and meta-data installed alongside ICU on any platform that doesn't already ship with them. It may well be useful for configuring other packages. (There is also other cool stuff that would become possible once ICU is optionally around, off topic.) There is something missing from the configure script however: the output of pkg-config is not making it into CFLAGS or LDFLAGS, so building fails on FreeBSD and MacOSX where for example <unicode/ucnv.h> doesn't live in the default search path. I tried very briefly to work out what but autoconfuse defeated me. You call the built-in strcoll/strxfrm collation provider "posix", and although POSIX does define those functions, it only does so because it inherits them from ISO C90. As far as I can tell Windows provides those functions because it conforms to the C spec, not the POSIX spec, though I suppose you could argue that in that respect it DOES conform to the POSIX spec... Also, we already have a collation called "POSIX". Maybe we should avoid confusing people with a "posix" provider and a "POSIX" collation? But then I'm not sure what else to call it, but perhaps "system" as Petr Jelinek suggested[1], or "libc". postgres=# select * from pg_collation where collname like 'en_NZ%'; ┌──────────────────┬───────────────┬───────────┬──────────────┬──────────────┬──────────────────┬──────────────────┬─────────────┐ │ collname │ collnamespace │ collowner │ collprovider │ collencoding │ collcollate │ collctype │ collversion │ ├──────────────────┼───────────────┼───────────┼──────────────┼──────────────┼──────────────────┼──────────────────┼─────────────┤ │ en_NZ │ 11 │ 10 │ p │ 6 │ en_NZ │ en_NZ │ 0 │ │ en_NZ │ 11 │ 10 │ p │ 8 │ en_NZ.ISO8859-1 │ en_NZ.ISO8859-1 │ 0 │ │ en_NZ │ 11 │ 10 │ p │ 16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │ 0 │ │ en_NZ.ISO8859-1 │ 11 │ 10 │ p │ 8 │ en_NZ.ISO8859-1 │ en_NZ.ISO8859-1 │ 0 │ │ en_NZ.ISO8859-15 │ 11 │ 10 │ p │ 16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │ 0 │ │ en_NZ.UTF-8 │ 11 │ 10 │ p │ 6 │ en_NZ.UTF-8 │ en_NZ.UTF-8 │ 0 │ │ en_NZ%icu │ 11 │ 10 │ i │ -1 │ en_NZ │ en_NZ │ -1724383232 │ └──────────────────┴───────────────┴───────────┴──────────────┴──────────────┴──────────────────┴──────────────────┴─────────────┘ (7 rows) I notice that you use encoding -1, meaning "all". I haven't fully grokked what that really means but it appears that we won't be able to create any new such collations after initdb using CREATE COLLATION, if for example you update your ICU installation and now have a new collation available. When I tried dropping some and recreating them they finished up with collencoding = 6. Should the initdb-created rows use 6 too? + ucol_getVersion(collator, versioninfo); + numversion = ntohl(*((uint32 *) versioninfo)); + + if (numversion != collform->collversion) + ereport(WARNING, + (errmsg("ICU collator version mismatch"), + errdetail("The database was created using version 0x%08X, the library provides version 0x%08X.", + (uint32) collform->collversion, (uint32) numversion), + errhint("Rebuild affected indexes, or build PostgreSQL with the right version of ICU."))); I played around with bumping version numbers artificially. That gives each session that accesses the collation one warning: postgres=# select * from foo order by id; WARNING: ICU collator version mismatch DETAIL: The database was created using version 0x99380001, the library provides version 0x99380000. HINT: Rebuild affected indexes, or build PostgreSQL with the right version of ICU. ┌────┐ │ id │ ├────┤ └────┘ (0 rows) postgres=# select * from foo order by id; ┌────┐ │ id │ ├────┤ └────┘ (0 rows) The warning persists even after I reindex all affected tables (and start a new backend), because you're only recording the collation at pg_collation level and then only setting it at initdb time, so that HINT isn't much use for clearing the warning. I think you'd need to record a snapshot of the version used for each collation that was used on *each index*, and update it whenever you CREATE INDEX or REINDEX etc. There can of course be more than one collation and thus version involved, if you have columns with different collations, so I guess you'd need an column to hold an array of version snapshots whose order corresponds to pg_index.indcollation's. I contemplated something like that for the built-in libc collations, based on extensions or external scripts that would know how to checksum the collation definition files for your particular operating system[2] and track them per index (and per column) in a user table. That's not material for a core feature but it did cause me to think about this problem a little bit. That was based on the idea that you'd run a script periodically or at least after OS upgrades, and either have it run index automatically or email you a set of command to run off hours. Obviously you can't refuse to come up on mismatch, or you'll never be able to REINDEX. Automatically launched unscheduled REINDEXes in core would be antisocial and never fly. But a warning could easily go unnoticed leading to corruption (not only reads, but also UNIQUE CONSTRAINTs not enforced, yada yada yada). I wonder what else we could reasonably do... A couple of thoughts about abbreviated keys: #ifndef TRUST_STRXFRM if (!collate_c) abbreviate = false; #endif I think this macro should affect only strxfrm, and we should trust ucol_getSortKey or disable it independently. ICU's manual says reassuring things like "Sort keys are most useful in databases" and "Sort keys are generally only useful in databases or other circumstances where function calls are extremely expensive". It looks like varstr_abbrev_convert calls strxfrm unconditionally (assuming TRUST_STRXFRM is defined). <captain-obvious>This needs to use ucol_getSortKey instead when appropriate.</> It looks like it's a bit more helpful than strxfrm about telling you the output buffer size it wants, and it doesn't need nul termination, which is nice. Unfortunately it is like strxfrm in that the output buffer's contents is unspecified if it ran out of space. +++ b/src/test/regress/expected/collate.icu.out @@ -0,0 +1,1076 @@ +/* + * This test is for Linux/glibc systems and assumes that a full set of + * locales is installed. It must be run in a database with UTF-8 encoding, + * because other encodings don't support all the characters used. + */ Should say ICU. [1] https://www.postgresql.org/message-id/3113bd83-9b3a-a95b-cf24-4b5b1dc6666f%402ndquadrant.com [2] https://github.com/macdice/check_pg_collations -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers