ICU locale validation / canonicalization

Jeff Davis Tue, 07 Feb 2023 23:59:45 -0800


Right now, ICU locales are not validated:


  initdb ... --locale-provider=icu --icu-locale=anything
  CREATE COLLATION foo (PROVIDER=icu, LOCALE='anything');
  CREATE DATABASE anythingdb ICU_LOCALE 'anything';

all succeed.

We do check that the value is accepted by ICU, but ICU seems to accept
anything and use some fallback logic. Bogus strings will typically end
up as the "root" locale (spelled "root" or "").

At first, I thought this was a bug. The ICU documentation[1] suggests
that the fallback logic can result in using the ICU default locale in
some cases. The default locale is problematic because it's affected by
the environment (LANG, LC_ALL, and strangely LC_MESSAGES; but strangely
not LC_COLLATE).

Fortunately, I didn't find any cases where it actually does fall back
to the default locale, so I think we're safe, but validation seems wise
regrardless. In different contexts we may want to fail (e.g. initdb
with a bogus locale), or warn, issue a notice that we changed the
string, or just silently change what the user entered to be in a
consistent form. BCP47 [2] seems to be the standard here, and we're
already using it when importing the icu collations.

ICU locale validation is not exactly straightforward, though, and I
suppose that's why it isn't already done. There's a document[3] that
explains canonicalization in terms of "level 1" and "level 2", and says
that ucol_canonicalize() provides level 2 canonicalization, but I am
not seeing all of the documented behavior in my tests. For instance,
the document says that "de__PHONEBOOK" should canonicalize to
"de@collation=phonebook", but instead I see that it remains
"de__PHONEBOOK". It also says that "C" should canonicalize to
"en_US_POSIX", but in my test, it just goes to "c".

The right entry point appears to get uloc_getLanguageTag(), which
internally calls uloc_canonicalize, but also converts to BCP47 format,
and gives the option for strictness. Non-strict mode seems problematic
because for "de__PHONEBOOK", it returns a langtag of plain "de", which
is a different actual locale than "de__PHONEBOOK". If uloc_canonicalize
worked as documented, it would have changed it to
"de@collation=phonebook" and the correct language tag "de-u-co-phonebk"
would be returned, which would find the right collator. I suppose that
means we would need to use strict mode.

And then we need to check whether it actually exists; i.e. reject well-
formed but bogus locales, like "wx-YZ". To do that, probably the most
straightforward way would be to initialize a UCollator and then query
it using ucol_getLocaleByType() with ULOC_VALID_LOCALE. If that results
in the root locale, we could say that it doesn't exist because it
failed to find a more suitable match (unless the user explicitly
requested the root locale). If it resolves to something else, we could
either just assume it's fine, or we could try to validate that it
matches what we expect in more detail. To be safe, we could double-
check that the resulting BCP 47 locale string loads the same actual
collator as what would have been loaded with the original string (also
check attributes?).

The overall benefit here is that we keep our catalogs consistently
using an independent standard format for ICU locale strings, rather
than whatever the user specifies. That makes it less likely that ICU
needs to use any fallback logic when trying to open a collator, which
could only be bad news.

Thoughts?


[1] https://unicode-org.github.io/icu/userguide/locale/#fallback
[2] https://en.wikipedia.org/wiki/IETF_language_tag
[3]
https://unicode-org.github.io/icu/userguide/locale/#canonicalization


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

ICU locale validation / canonicalization

Reply via email to