On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote: > provider | isalpha | isdigit > ----------+---------+--------- > ICU | f | t > glibc | t | f > builtin | f | f
The "ICU" above is really the behvior of the Postgres ICU provider as we implemented it, it's not something forced on us by ICU. For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are defined by ICU to be equivalent to java.lang.Character.isLetter() and java.lang.Character.isDigit(). ICU documents[3] how regex character classes should be implemented using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite being under the heading "...for C/POSIX character classes...", [3] says it's based on the "Standard" variant of [4], rather than "POSIX Compatible". (Aside: the Postgres ICU provider doesn't match what [3] suggests for the "alpha" class. For the character U+FF11 it doesn't matter, but I suspect there are differences for other characters. This should be fixed.) The differences between PG_C_UTF8 and what ICU suggests are just because the former uses the "POSIX Compatible" definitions and the latter uses "Standard". I implemented both the "Standard" and "POSIX Compatible" compatibility properties in ad49994538, so it would be easy to change what PG_C_UTF8 uses. [1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283 [2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6 [3] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details [4] http://www.unicode.org/reports/tr18/#Compatibility_Properties > Are we fine with pg_c_utf8 differing from both ICU's point of view > (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is > not > digit, but it's alpha)? Yes, some differences are to be expected. But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as we can point to something other than "glibc version 2.35 does it this way". Regards, Jeff Davis