On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:
> I think the code for (2) and for "I/i in Turkish" haven't returned.
> Given
> commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
> lower(),
> the regex code likely needs a similar restoration. If not, the regex
> comments
> would need to change to match the code.
Great find, thank you! I'm curious how you came about this difference,
was it through testing or code inspection?
Patch attached. I also updated the top of the comment so that it's
clear that it's referring to the libc provider specifically, and that
ICU still has an issue with non-UTF8 encodings.
Also, the force-to-ASCII-behavior special case is different for
pg_wc_tolower/uppper vs LOWER()/UPPER: the former depends only on
whether it's the default locale, whereas the latter depends on whether
it's the default locale and the encoding is single-byte. Therefore the
results in the tr_TR.UTF-8 locale for the libc provider are
inconsistent:
=> select 'i' ~* 'I', 'I' ~* 'i', lower('I') = 'i', upper('i') = 'I';
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | t | f | f
That behavior goes back a long way, so I'm not suggesting that we
change it.
Regards,
Jeff Davis
From e8a68f42f5802d138ba04043b25b7d42862be29d Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 14 Apr 2025 11:34:11 -0700
Subject: [PATCH v1] Another unintentional behavior change in commit
e9931bfb75.
Reported-by: Noah Misch <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
src/backend/regex/regc_pg_locale.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index ed7411df83d..41b993ad773 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -21,9 +21,10 @@
#include "utils/pg_locale.h"
/*
- * To provide as much functionality as possible on a variety of platforms,
- * without going so far as to implement everything from scratch, we use
- * several implementation strategies depending on the situation:
+ * For the libc provider, to provide as much functionality as possible on a
+ * variety of platforms without going so far as to implement everything from
+ * scratch, we use several implementation strategies depending on the
+ * situation:
*
* 1. In C/POSIX collations, we use hard-wired code. We can't depend on
* the <ctype.h> functions since those will obey LC_CTYPE. Note that these
@@ -33,8 +34,9 @@
*
* 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
* This assumes that every platform uses Unicode codepoints directly
- * as the wchar_t representation of Unicode. On some platforms
- * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
+ * as the wchar_t representation of Unicode. (XXX: This could be a problem
+ * for ICU in non-UTF8 encodings.) On some platforms wchar_t is only 16 bits
+ * wide, so we have to punt for codepoints > 0xFFFF.
*
* 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
* values up to 255, and punt for values above that. This is 100% correct
@@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c)
case PG_REGEX_STRATEGY_BUILTIN:
return unicode_uppercase_simple(c);
case PG_REGEX_STRATEGY_LIBC_WIDE:
+ /* force C behavior for ASCII characters, per comments above */
+ if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+ return pg_ascii_toupper((unsigned char) c);
if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
return towupper_l((wint_t) c, pg_regex_locale->info.lt);
/* FALL THRU */
case PG_REGEX_STRATEGY_LIBC_1BYTE:
+ /* force C behavior for ASCII characters, per comments above */
+ if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+ return pg_ascii_toupper((unsigned char) c);
if (c <= (pg_wchar) UCHAR_MAX)
return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
return c;
@@ -590,10 +598,16 @@ pg_wc_tolower(pg_wchar c)
case PG_REGEX_STRATEGY_BUILTIN:
return unicode_lowercase_simple(c);
case PG_REGEX_STRATEGY_LIBC_WIDE:
+ /* force C behavior for ASCII characters, per comments above */
+ if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+ return pg_ascii_tolower((unsigned char) c);
if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
return towlower_l((wint_t) c, pg_regex_locale->info.lt);
/* FALL THRU */
case PG_REGEX_STRATEGY_LIBC_1BYTE:
+ /* force C behavior for ASCII characters, per comments above */
+ if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+ return pg_ascii_tolower((unsigned char) c);
if (c <= (pg_wchar) UCHAR_MAX)
return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
return c;
--
2.34.1