Re: Add CASEFOLD() function.

Jeff Davis Tue, 17 Jun 2025 11:15:17 -0700

On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote:
> If the character set of <character factor> is UTF8, UTF16, or UTF32,
> then FR is replaced by
>      Case:
>          i) If the <search condition> S IS NORMALIZED evaluates to
> True, then NORMALIZE (FR)
>          ii) Otherwise, FR.


I read that as "if the input is normalized, then the output should be
normalized", IOW preserve the normalization. But does it mean "preserve
whatever the input normal form is" or "preserve NFC if the input is
NFC, otherwise the normalization is undefined"?

The above wording seems to mean "preserve NFC if the input is NFC",
because that's what NORMALIZE(FR) does when the normal form is
unspecified.

> It does not appear to me that our LOWER and UPPER functions obey this
> rule,

You are correct:

   WITH s(t) AS
   (SELECT NORMALIZE(U&'\00C1\00DF\0301' COLLATE "en-US-x-icu"))
   SELECT UPPER(t) = NORMALIZE(UPPER(t)) FROM s;
    ?column? 
   ----------
    f

>  so there is a valid argument that we should continue to ignore it.
> Or, we can say that we have at least one of three compliant.

What do other databases do?

Given how costly normalization can be, imposing that on every caller
seems like a bit much. And favoring NFC for the user unconditionally
might not be the best thing. Then again, NFC is good most of the time,
and there are patches to speed up normalization.

I tend to think that a lot of users who want casefolding would also
want normalization, but it's hard to weigh that against the performance
cost. It might not matter outside of a few edge cases, though I'm not
sure exactly how many.

Regards,
        Jeff Davis

Re: Add CASEFOLD() function.

Reply via email to