On 28.02.23 06:57, Jeff Davis wrote:
On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:

New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

New patch attached.

Now it always preserves the exact locale string during pg_upgrade, and
does not attempt to canonicalize it. Before it was trying to be clever
by determining if the language tag was finding the same collator as the
original string -- I didn't find a problem with that, but it just
seemed a bit too clever. So, only newly-created locales and databases
have the ICU locale string canonicalized to a language tag.

Also, I added a SQL function pg_icu_language_tag() that can convert
locale strings to language tags, and check whether they exist or not.

This patch appears to do about three things at once, and it's not clear exactly where the boundaries are between them and which ones we might actually want. And I think the terminology also gets mixed up a bit, which makes following this harder.

1. Canonicalizing the locale string. This is presumably what uloc_canonicalize() does, which the patch doesn't actually use. What are examples of what this does? Does the patch actually do this?

2. Converting the locale string to BCP 47 format. This converts 'de@collation=phonebook' to 'de-u-co-phonebk'. This is what uloc_getLanguageTag() does.

3. Validating the locale string, to reject faulty input.

What are the relationships between these?

I don't understand how the validation actually happens in your patch. Does uloc_getLanguageTag() do the validation also?

Can you do canonicalization without converting to language tag?

Can you do validation of un-canonicalized locale names?

What is the guidance for the use of the icu_locale_validation GUC?

The description throws in yet another term: "validates that ICU locale strings are well-formed". What is "well-formed"? How does that relate to the other concepts?

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale = 'de@collation=phonebook'); NOTICE: 00000: using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to migrate people to the "new" spellings, but the old ones aren't wrong. So this should be a separate consideration, with an option, and it would require various updates in the documentation. It also doesn't appear to address how to handle ICU before version 54.

But, see earlier questions, are these three things all connected somehow?



Reply via email to