encoding affects ICU regex character classification

Jeff Davis Wed, 29 Nov 2023 15:46:49 -0800

The following query:

    SELECT U&'\017D' ~ '[[:alpha:]]' collate "en-US-x-icu";


returns true if the server encoding is UTF8, and false if the server
encoding is LATIN9. That's a bug -- any behavior involving ICU should
be encoding-independent.

The problem seems to be confusion between pg_wchar and a unicode code
point in pg_wc_isalpha() and related functions.

It might be good to introduce some infrastructure here that can convert
a pg_wchar into a Unicode code point, or decode a string of bytes into
a string of 32-bit code points. Right now, that's possible, but it
involves pg_wchar2mb() followed by encoding conversion to UTF8,
followed by decoding the UTF8 to a code point. (Is there an easier path
that I missed?)

One wrinkle is MULE_INTERNAL, which doesn't have any conversion path to
UTF8. That's not important for ICU (because ICU is not allowed for that
encoding), but I'd like it if we could make this infrastructure
independent of ICU, because I have some follow-up proposals to simplify
character classification here and in ts_locale.c.

Thoughts?

Regards,
    Jeff Davis

encoding affects ICU regex character classification

Reply via email to