On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote: > One use case is that if a user specifies a locale, say, of 'de-AT', > this > might canonicalize to 'de' today,
Canonicalization should not lose useful information, it should just rearrange it, so I don't see a risk here based on what I read and the behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes the language tag "de-AT". > but we should still store what the > user specified because 1) that documents what the user wanted, and 2) > it > might not canonicalize to the same thing tomorrow. We don't want to store things with ambiguous interpretations that could change tomorrow; that's a recipe for trouble. That's why most people store timestamps as the offset from some epoch in UTC rather than as "2/9/23" (Feb 9 or Sept 2? 1923 or 2023?). There are exceptions where you would want to store something like that, but I don't see why they'd apply in this case, where reinterpretation probably means a corrupted index. If the user wants to know how their ad-hoc string was interpreted, they can look at the resulting BCP 47 language tag, and see if it's what they meant. We can try to make this user-friendly by offering a NOTICE, WARNING, or helper functions that allow them to explore. We can also double check that the canonicalized form resolves to the same actual collator to be safe, and maybe even fall back to whatever the user specified if not. I'm open to discuss how strict we want to be and what kind of escape hatches we need to offer. There is still a risk that the BCP 47 language tag resolves to a different specific ICU collator or different collator version tomorrow. That's why we need to be careful about versioning (library versions or collator versions or both), and we've had long discussions about that. -- Jeff Davis PostgreSQL Contributor Team - AWS