On Tue, 2025-06-17 at 17:37 +0200, Vik Fearing wrote: > If the character set of <character factor> is UTF8, UTF16, or UTF32, > then FR is replaced by > Case: > i) If the <search condition> S IS NORMALIZED evaluates to > True, then NORMALIZE (FR) > ii) Otherwise, FR.
I read that as "if the input is normalized, then the output should be normalized", IOW preserve the normalization. But does it mean "preserve whatever the input normal form is" or "preserve NFC if the input is NFC, otherwise the normalization is undefined"? The above wording seems to mean "preserve NFC if the input is NFC", because that's what NORMALIZE(FR) does when the normal form is unspecified. > It does not appear to me that our LOWER and UPPER functions obey this > rule, You are correct: WITH s(t) AS (SELECT NORMALIZE(U&'\00C1\00DF\0301' COLLATE "en-US-x-icu")) SELECT UPPER(t) = NORMALIZE(UPPER(t)) FROM s; ?column? ---------- f > so there is a valid argument that we should continue to ignore it. > Or, we can say that we have at least one of three compliant. What do other databases do? Given how costly normalization can be, imposing that on every caller seems like a bit much. And favoring NFC for the user unconditionally might not be the best thing. Then again, NFC is good most of the time, and there are patches to speed up normalization. I tend to think that a lot of users who want casefolding would also want normalization, but it's hard to weigh that against the performance cost. It might not matter outside of a few edge cases, though I'm not sure exactly how many. Regards, Jeff Davis