On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote: > I don't know. I am just pointing out what the Standard says. I > think > we should either comply, or say that we don't do it for LOWER and > UPPER > so let's keep things implementation-consistent.
For the standard, I see two potential philosophies: I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should preserve NFC in the same way. II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque text value that is useful for caseless matching, but should not ordinarily be used for display or sent to the application (those things would be allowed, just not encouraged). For normalization, either: (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and don't require any kind of normalization; or (B) Follow Unicode Canonical Caseless Matching (D145), and require that the input and output are normalized appropriately, but leave the precise normal form as implementation-defined. The current implementation could either be seen as philosophy (I) where we've chosen to ignore the normalization part for the sake of consistency with LOWER()/UPPER(); or it could be seen as philosophy (II)(A). > How much does it cost to check for NFC? I honestly don't know the > answer to that question, but that is the only case where we need to > maintain normalization. I attached a very rough patch and ran a very simple test on strings averaging 36 bytes in length, all already in NFC and the result is also NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3 seconds, afterward about 8. There's a patch to optimize some of the normalization paths, which I haven't had a chance to review yet. So those numbers might come down. > > It's not unconditionally, it's only if the input was NFC. Optimizing the case where the input is _not_ NFC seems strange to me. If we are normalizing the output, I'd say we should just make the output always NFC. Being more strict, this seems likely to comply with the eventual standard. Additionally, if we are normalizing the output, then we should also do the input fixup for U+0345, which would make the result usable for Canonical Caseless Matching. Again, this seems likely to comply with the eventual standard. > So I only see two reasonable implementations: 1. The current CASEFOLD() implementation. 2. Do the input fixup for U+0345 and unconditionally normalize the output in NFC. If there's a case to be made for both implementations, we could also consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD() for #2. I'm not sure whether we'd want to standardize one or both of those functions. And if you think there's likely to be a collision with the standard that's hard to anticipate and fix now, then we should consider reverting CASEFOLD() for 18 and wait for more progress on the standardization. What's the likelihood that the name changes or something like that? Regards, Jeff Davis
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 5bd1e01f7e4..12e688acec6 100644 --- a/src/backend/utils/adt/formatting.c +++ b/src/backend/utils/adt/formatting.c @@ -79,6 +79,7 @@ #include "common/int.h" #include "common/unicode_case.h" #include "common/unicode_category.h" +#include "common/unicode_norm.h" #include "mb/pg_wchar.h" #include "nodes/miscnodes.h" #include "parser/scansup.h" @@ -1866,6 +1867,9 @@ str_casefold(const char *buff, size_t nbytes, Oid collid) size_t dstsize; char *dst; size_t needed; + int mblen, i; + unsigned char *p; + pg_wchar *decoded; /* first try buffer of equal size plus terminating NUL */ dstsize = srclen + 1; @@ -1882,7 +1886,54 @@ str_casefold(const char *buff, size_t nbytes, Oid collid) } Assert(dst[needed] == '\0'); - result = dst; + + /* convert to pg_wchar */ + mblen = pg_mbstrlen_with_len(dst, needed); + decoded = palloc((mblen + 1) * sizeof(pg_wchar)); + p = (unsigned char *) dst; + for (i = 0; i < mblen; i++) + { + decoded[i] = utf8_to_unicode(p); + p += pg_utf_mblen(p); + } + decoded[i] = (pg_wchar) '\0'; + + if (unicode_is_normalized_quickcheck(UNICODE_NFC, decoded) == UNICODE_NORM_QC_YES) + { + pfree(decoded); + result = dst; + } + else + { + pg_wchar *normalized; + unsigned char *normalized_utf8; + + normalized = unicode_normalize(UNICODE_NFC, decoded); + pfree(decoded); + + /* convert back to UTF-8 string */ + mblen = 0; + for (pg_wchar *wp = normalized; *wp; wp++) + { + unsigned char buf[4]; + + unicode_to_utf8(*wp, buf); + mblen += pg_utf_mblen(buf); + } + + normalized_utf8 = palloc(mblen + 1); + + p = normalized_utf8; + for (pg_wchar *wp = normalized; *wp; wp++) + { + unicode_to_utf8(*wp, p); + p += pg_utf_mblen(p); + } + *p = '\0'; + pfree(normalized); + + result = (char *) normalized_utf8; + } } return result;