Re: Add CASEFOLD() function.

Jeff Davis Wed, 18 Jun 2025 19:53:26 -0700

On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote:
> I don't know.  I am just pointing out what the Standard says.  I
> think 
> we should either comply, or say that we don't do it for LOWER and
> UPPER 
> so let's keep things implementation-consistent.


For the standard, I see two potential philosophies:

I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should
preserve NFC in the same way.

II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque
text value that is useful for caseless matching, but should not
ordinarily be used for display or sent to the application (those things
would be allowed, just not encouraged). For normalization, either:
  (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and
don't require any kind of normalization; or
  (B) Follow Unicode Canonical Caseless Matching (D145), and require
that the input and output are normalized appropriately, but leave the
precise normal form as implementation-defined.


The current implementation could either be seen as philosophy (I) where
we've chosen to ignore the normalization part for the sake of
consistency with LOWER()/UPPER(); or it could be seen as philosophy
(II)(A).

> How much does it cost to check for NFC?  I honestly don't know the 
> answer to that question, but that is the only case where we need to 
> maintain normalization.

I attached a very rough patch and ran a very simple test on strings
averaging 36 bytes in length, all already in NFC and the result is also
NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3
seconds, afterward about 8.

There's a patch to optimize some of the normalization paths, which I
haven't had a chance to review yet. So those numbers might come down. 

> 
> It's not unconditionally, it's only if the input was NFC.

Optimizing the case where the input is _not_ NFC seems strange to me.
If we are normalizing the output, I'd say we should just make the
output always NFC. Being more strict, this seems likely to comply with
the eventual standard.

Additionally, if we are normalizing the output, then we should also do
the input fixup for U+0345, which would make the result usable for
Canonical Caseless Matching. Again, this seems likely to comply with
the eventual standard.

> 

So I only see two reasonable implementations:

1. The current CASEFOLD() implementation.

2. Do the input fixup for U+0345 and unconditionally normalize the
output in NFC.

If there's a case to be made for both implementations, we could also
consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD()
for #2. I'm not sure whether we'd want to standardize one or both of
those functions.

And if you think there's likely to be a collision with the standard
that's hard to anticipate and fix now, then we should consider
reverting CASEFOLD() for 18 and wait for more progress on the
standardization. What's the likelihood that the name changes or
something like that?

Regards,
        Jeff Davis

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 5bd1e01f7e4..12e688acec6 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -79,6 +79,7 @@
 #include "common/int.h"
 #include "common/unicode_case.h"
 #include "common/unicode_category.h"
+#include "common/unicode_norm.h"
 #include "mb/pg_wchar.h"
 #include "nodes/miscnodes.h"
 #include "parser/scansup.h"
@@ -1866,6 +1867,9 @@ str_casefold(const char *buff, size_t nbytes, Oid collid)
 		size_t		dstsize;
 		char	   *dst;
 		size_t		needed;
+		int mblen, i;
+		unsigned char *p;
+		pg_wchar   *decoded;
 
 		/* first try buffer of equal size plus terminating NUL */
 		dstsize = srclen + 1;
@@ -1882,7 +1886,54 @@ str_casefold(const char *buff, size_t nbytes, Oid collid)
 		}
 
 		Assert(dst[needed] == '\0');
-		result = dst;
+
+		/* convert to pg_wchar */
+		mblen = pg_mbstrlen_with_len(dst, needed);
+		decoded = palloc((mblen + 1) * sizeof(pg_wchar));
+		p = (unsigned char *) dst;
+		for (i = 0; i < mblen; i++)
+		{
+			decoded[i] = utf8_to_unicode(p);
+			p += pg_utf_mblen(p);
+		}
+		decoded[i] = (pg_wchar) '\0';
+
+		if (unicode_is_normalized_quickcheck(UNICODE_NFC, decoded) == UNICODE_NORM_QC_YES)
+		{
+			pfree(decoded);
+			result = dst;
+		}
+		else
+		{
+			pg_wchar *normalized;
+			unsigned char *normalized_utf8;
+
+			normalized = unicode_normalize(UNICODE_NFC, decoded);
+			pfree(decoded);
+
+			/* convert back to UTF-8 string */
+			mblen = 0;
+			for (pg_wchar *wp = normalized; *wp; wp++)
+			{
+				unsigned char buf[4];
+
+				unicode_to_utf8(*wp, buf);
+				mblen += pg_utf_mblen(buf);
+			}
+
+			normalized_utf8 = palloc(mblen + 1);
+
+			p = normalized_utf8;
+			for (pg_wchar *wp = normalized; *wp; wp++)
+			{
+				unicode_to_utf8(*wp, p);
+				p += pg_utf_mblen(p);
+			}
+			*p = '\0';
+			pfree(normalized);
+
+			result = (char *) normalized_utf8;
+		}
 	}
 
 	return result;

Re: Add CASEFOLD() function.

Reply via email to