Right now, ICU locales are not validated:
initdb ... --locale-provider=icu --icu-locale=anything CREATE COLLATION foo (PROVIDER=icu, LOCALE='anything'); CREATE DATABASE anythingdb ICU_LOCALE 'anything'; all succeed. We do check that the value is accepted by ICU, but ICU seems to accept anything and use some fallback logic. Bogus strings will typically end up as the "root" locale (spelled "root" or ""). At first, I thought this was a bug. The ICU documentation[1] suggests that the fallback logic can result in using the ICU default locale in some cases. The default locale is problematic because it's affected by the environment (LANG, LC_ALL, and strangely LC_MESSAGES; but strangely not LC_COLLATE). Fortunately, I didn't find any cases where it actually does fall back to the default locale, so I think we're safe, but validation seems wise regrardless. In different contexts we may want to fail (e.g. initdb with a bogus locale), or warn, issue a notice that we changed the string, or just silently change what the user entered to be in a consistent form. BCP47 [2] seems to be the standard here, and we're already using it when importing the icu collations. ICU locale validation is not exactly straightforward, though, and I suppose that's why it isn't already done. There's a document[3] that explains canonicalization in terms of "level 1" and "level 2", and says that ucol_canonicalize() provides level 2 canonicalization, but I am not seeing all of the documented behavior in my tests. For instance, the document says that "de__PHONEBOOK" should canonicalize to "de@collation=phonebook", but instead I see that it remains "de__PHONEBOOK". It also says that "C" should canonicalize to "en_US_POSIX", but in my test, it just goes to "c". The right entry point appears to get uloc_getLanguageTag(), which internally calls uloc_canonicalize, but also converts to BCP47 format, and gives the option for strictness. Non-strict mode seems problematic because for "de__PHONEBOOK", it returns a langtag of plain "de", which is a different actual locale than "de__PHONEBOOK". If uloc_canonicalize worked as documented, it would have changed it to "de@collation=phonebook" and the correct language tag "de-u-co-phonebk" would be returned, which would find the right collator. I suppose that means we would need to use strict mode. And then we need to check whether it actually exists; i.e. reject well- formed but bogus locales, like "wx-YZ". To do that, probably the most straightforward way would be to initialize a UCollator and then query it using ucol_getLocaleByType() with ULOC_VALID_LOCALE. If that results in the root locale, we could say that it doesn't exist because it failed to find a more suitable match (unless the user explicitly requested the root locale). If it resolves to something else, we could either just assume it's fine, or we could try to validate that it matches what we expect in more detail. To be safe, we could double- check that the resulting BCP 47 locale string loads the same actual collator as what would have been loaded with the original string (also check attributes?). The overall benefit here is that we keep our catalogs consistently using an independent standard format for ICU locale strings, rather than whatever the user specifies. That makes it less likely that ICU needs to use any fallback logic when trying to open a collator, which could only be bad news. Thoughts? [1] https://unicode-org.github.io/icu/userguide/locale/#fallback [2] https://en.wikipedia.org/wiki/IETF_language_tag [3] https://unicode-org.github.io/icu/userguide/locale/#canonicalization -- Jeff Davis PostgreSQL Contributor Team - AWS