Re: [HACKERS] ICU integration

Andreas Karlsson Sun, 19 Mar 2017 15:33:07 -0700

On 03/15/2017 05:33 PM, Peter Eisentraut wrote:

Updated patch attached.

Ok, I applied to patch again and ran the tests. I get a test failure onmake check-world in the pg_dump tests but it can be fixed with the below.

diff --git a/src/bin/pg_dump/t/002_pg_dump.plb/src/bin/pg_dump/t/002_pg_dump.pl

index 3cac4a9ae0..7d9c90363b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl

@@ -2422,7 +2422,7 @@ qr/^\QINSERT INTO test_fifth_table (col1, col2,col3, col4, col5) VALUES (NULL,

          'CREATE COLLATION test0 FROM "C";',
        regexp =>
          qr/^
-         \QCREATE COLLATION test0 (lc_collate = 'C', lc_ctype = 'C');\E/xm,
+         \QCREATE COLLATION test0 (provider = 'libc', locale = 'C');\E/xm,
        like => {
            binary_upgrade           => 1,
            clean                    => 1,

- I do not like how it is not obvious which is the default version of
every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
"sv_standard%icu" as one might expect. Is this inherent to ICU or
something we can work around?


We get these keywords from ucol_getKeywordValuesForLocale(), which says
"Given a key and a locale, returns an array of string values in a
preferred order that would make a difference."  So all those are
supposedly different from each other.

I believe you are mistaken. The locale "sv" is just an alias for"sv-u-standard" as far as I can tell. See the definition of the Swedishlocale(http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/coll/sv.txt)and there are just three collations: reformed (default), standard,search (plus eot and emoji which are inherited).

I am also quite annoyed at col-emoji and col-eor (European OrderingRules). They are defined at the root and inherited by all languages, butno language modifies col-emoji for their needs which makes it a footgun. See the Danish sorting example below where at least I expected thesame order. For col-eor it makes a bit more sense since I would expectthe locales en_GB-u-col-eot and sv_SE-u-col-eot to collate exactly the same.

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE"da-x-icu";

 x
----
 a
 k
 aa
(3 rows)

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE"da-u-co-emoji-x-icu";

 x
----
 a
 aa
 k
(3 rows)

It seems ICU has made quite the mess here, and I am not sure to whatdegree we need to handle it to avoid our users getting confused. I needto think some of it, and would love input from others. Maybe the rightthing to do is to ignore the issue with col-emoji, but I would want todo something about the default collations.

- ICU talks about a new syntax for locale extensions (old:
de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this page
http://userguide.icu-project.org/collation/api. Is this something we
need to car about? It looks like we currently use the old format, and
while I personally prefer it I am not sure we should rely on an old syntax.


Interesting.  I hadn't see this before, and the documentation is sparse.
 But it seems that this refers to BCP 47 language tags, which seem like
a good idea.

So what I have done is change all the predefined ICU collations to be
named after the BCP 47 scheme, with a "private use" -x-icu suffix
(instead of %icu).  The preserves the original idea but uses a standard
naming scheme.

I'm not terribly worried that they are going to remove the "old" locale
naming, but just to be forward looking, I have changed it so that the
collcollate entries are made using the "new" naming for ICU >=54.


Sounds good.

- I get an error when creating a new locale.

#CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Success

# CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Resource temporarily unavailable
Time: 1.109 ms


Hmm, that's pretty straightforward code.  What is your operating system?
 What are the build options?  Does it work without this patch?

This issue is unrelated to ICU. I had forgot to specify provider so theeorrs are correct (even though that the value of the errno is weird).

- For the collprovider is it really correct to say that 'd' is the
default value as it does in catalogs.sgml?


It doesn't say it's the default value, it says it uses the database
default.  This is all a bit confusing.  We have a collation object named
"default", which uses the locale set for the database.  That's been that
way for a while.  Now when introducing the collation providers, that
"default" collation gets its own collprovider category 'd'.  That is not
really used anywhere, but we have to put something there.

Ah, I see now. Hm, that is a bit awkward but I cannot think of a cleaneralternative.

- I do not know what the policy for formatting the documentation is, but
some of the paragraphs are in need of re-wrapping.


Hmm, I don't see anything terribly bad.

Maybe it is just me who is sensitive to wrapping, but her is an examplewhich sticks out to me with its 98 character line.


+   <para>

+ A collation object provided by <literal>libc</literal> maps to acombination+ of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>settings. (As

     the name would suggest, the main purpose of a collation is to set
     <symbol>LC_COLLATE</symbol>, which controls the sort order.  But
     it is rarely necessary in practice to have an
     <symbol>LC_CTYPE</symbol> setting that is different from
     <symbol>LC_COLLATE</symbol>, so it is more convenient to collect
     these under one concept than to create another infrastructure for
-    setting <symbol>LC_CTYPE</symbol> per expression.)  Also, a collation

+ setting <symbol>LC_CTYPE</symbol> per expression.) Also, a<literal>libc</literal> collation

     is tied to a character set encoding (see <xref linkend="multibyte">).
     The same collation name may exist for different encodings.
    </para>

- Add a hint to "ERROR:  conflicting or redundant options". The error
message is pretty unclear.


I don't see that in my patch.  Example?


It is for the below. But after some thinking I am fine not fixing it.

# CREATE COLLATION svi (LOCALE = 'sv', LC_COLLATE = 'sv', PROVIDER =  icu);
ERROR:  conflicting or redundant options

- I am not a fan of this patch comparing the encoding with a -1 literal.
How about adding -1 as a value to the enum? See the example below for a
place where the patch compares with -1.


That's existing practice.  Not a great practice, probably, but widespread.

Ok, if it is widespread in the code then let's keep using it. In a caselike this consistency is just as important.

- The patch adds "FIXME" in the below. Is that a left-over from
development or something which still needs doing?

     /*
      * Also forbid matching an any-encoding entry.  This test of course
is not
      * backed up by the unique index, but it's not a problem since we don't
-    * support adding any-encoding entries after initdb.
+    * support adding any-encoding entries after initdb. FIXME
      */


I had mentioned that upthread.  It technically needs "doing" as you say,
but it's not clear how and it's not terribly important, arguably.

The comment is no longer true since for ICU we can do that (it is not anissue though). At the very least this comment needs to be updated.


Andreas


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

Reply via email to