Re: encoding affects ICU regex character classification

Thomas Munro Wed, 29 Nov 2023 18:11:37 -0800

On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis <[email protected]> wrote:
> Character classification is not localized at all in libc or ICU as far
> as I can tell.


Really?  POSIX isalpha()/isalpha_l() and friends clearly depend on a
locale.  See eg d522b05c for a case where that broke something.
Perhaps you mean glibc wouldn't do that to you because you know that,
as an unstandardised detail, it sucks in (some version of) Unicode's
data which shouldn't vary between locales.  But you are allowed to
make your own locales, including putting whatever classifications you
want into the LC_TYPE file using POSIX-standardised tools like
localedef.  Perhaps that is a bit of a stretch, and no one really does
that in practice, but anyway it's still "localized".

Not knowing anything about how glibc generates its charmaps, Unicode
or pre-Unicode, I could take a wild guess that maybe in LATIN9 they
have an old hand-crafted table, but for UTF-8 encoding it's fully
outsourced to Unicode, and that's why you see a difference.  Another
problem seen in a few parts of our tree is that we sometimes feed
individual UTF-8 bytes to the isXXX() functions which is about as well
defined as trying to pay for a pint with the left half of a $10 bill.

As for ICU, it's "not localized" only if there is only one ICU library
in the universe, but of course different versions of ICU might give
different answers because they correspond to different versions of
Unicode (as do glibc versions, FreeBSD libc versions, etc) and also
might disagree with tables built by PostgreSQL.  Maybe irrelevant for
now, but I think with thus-far-imagined variants of the multi-version
ICU proposal, you have to choose whether to call u_isUAlphabetic() in
the library we're linked against, or via the dlsym() we look up in a
particular dlopen'd library.  So I guess we'd have to access it via
our pg_locale_t, so again it'd be "localized" by some definitions.

Thinking about how to apply that thinking to libc, ... this is going
to sound far fetched and handwavy but here goes:  we could even
imagine a multi-version system based on different base locale paths.
Instead of using the system-provided locales under /usr/share/locale
to look when we call newlocale(..., "en_NZ.UTF-8", ...), POSIX says
we're allowed to specify an absolute path eg newlocale(...,
"/foo/bar/unicode11/en_NZ.UTF-8", ...).  If it is possible to use
$DISTRO's localedef to compile $OLD_DISTRO's locale sources to get
historical behaviour, that might provide a way to get them without
assuming the binary format is stable (it definitely isn't, but the
source format is nailed down by POSIX).  One fly in the ointment is
that glibc failed to implement absolute path support, so you might
need to use versioned locale names instead, or see if the LOCPATH
environment variable can be swizzled around without confusing glibc's
locale cache.  Then wouldn't be fundamentally different than the
hypothesised multi-version ICU case: you could probably come up with
different isalpha_l() results for different locales because you have
different LC_CTYPE versions (for example Unicode 15.0 added new
extended Cyrillic characters 1E030..1E08F, they look alphabetical to
me but what would I know).  That is an extremely hypothetical
pie-in-the-sky thought and I don't know if it'd really work very well,
but it is a concrete way that someone might finish up getting
different answers out of isalpha_l(), to observe that it really is
localised.  And localized.

Re: encoding affects ICU regex character classification

Reply via email to