Re: ICU locale validation / canonicalization

Jeff Davis Thu, 09 Feb 2023 13:16:05 -0800

On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:
> One use case is that if a user specifies a locale, say, of 'de-AT',
> this 
> might canonicalize to 'de' today,


Canonicalization should not lose useful information, it should just
rearrange it, so I don't see a risk here based on what I read and the
behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
the language tag "de-AT".

> but we should still store what the 
> user specified because 1) that documents what the user wanted, and 2)
> it 
> might not canonicalize to the same thing tomorrow.

We don't want to store things with ambiguous interpretations that could
change tomorrow; that's a recipe for trouble. That's why most people
store timestamps as the offset from some epoch in UTC rather than as
"2/9/23" (Feb 9 or Sept 2? 1923 or 2023?). There are exceptions where
you would want to store something like that, but I don't see why they'd
apply in this case, where reinterpretation probably means a corrupted
index.

If the user wants to know how their ad-hoc string was interpreted, they
can look at the resulting BCP 47 language tag, and see if it's what
they meant. We can try to make this user-friendly by offering a NOTICE,
WARNING, or helper functions that allow them to explore. We can also
double check that the canonicalized form resolves to the same actual
collator to be safe, and maybe even fall back to whatever the user
specified if not. I'm open to discuss how strict we want to be and what
kind of escape hatches we need to offer.

There is still a risk that the BCP 47 language tag resolves to a
different specific ICU collator or different collator version tomorrow.
That's why we need to be careful about versioning (library versions or
collator versions or both), and we've had long discussions about that.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: ICU locale validation / canonicalization

Reply via email to