On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis <pg...@j-davis.com> wrote: > Character classification is not localized at all in libc or ICU as far > as I can tell.
Really? POSIX isalpha()/isalpha_l() and friends clearly depend on a locale. See eg d522b05c for a case where that broke something. Perhaps you mean glibc wouldn't do that to you because you know that, as an unstandardised detail, it sucks in (some version of) Unicode's data which shouldn't vary between locales. But you are allowed to make your own locales, including putting whatever classifications you want into the LC_TYPE file using POSIX-standardised tools like localedef. Perhaps that is a bit of a stretch, and no one really does that in practice, but anyway it's still "localized". Not knowing anything about how glibc generates its charmaps, Unicode or pre-Unicode, I could take a wild guess that maybe in LATIN9 they have an old hand-crafted table, but for UTF-8 encoding it's fully outsourced to Unicode, and that's why you see a difference. Another problem seen in a few parts of our tree is that we sometimes feed individual UTF-8 bytes to the isXXX() functions which is about as well defined as trying to pay for a pint with the left half of a $10 bill. As for ICU, it's "not localized" only if there is only one ICU library in the universe, but of course different versions of ICU might give different answers because they correspond to different versions of Unicode (as do glibc versions, FreeBSD libc versions, etc) and also might disagree with tables built by PostgreSQL. Maybe irrelevant for now, but I think with thus-far-imagined variants of the multi-version ICU proposal, you have to choose whether to call u_isUAlphabetic() in the library we're linked against, or via the dlsym() we look up in a particular dlopen'd library. So I guess we'd have to access it via our pg_locale_t, so again it'd be "localized" by some definitions. Thinking about how to apply that thinking to libc, ... this is going to sound far fetched and handwavy but here goes: we could even imagine a multi-version system based on different base locale paths. Instead of using the system-provided locales under /usr/share/locale to look when we call newlocale(..., "en_NZ.UTF-8", ...), POSIX says we're allowed to specify an absolute path eg newlocale(..., "/foo/bar/unicode11/en_NZ.UTF-8", ...). If it is possible to use $DISTRO's localedef to compile $OLD_DISTRO's locale sources to get historical behaviour, that might provide a way to get them without assuming the binary format is stable (it definitely isn't, but the source format is nailed down by POSIX). One fly in the ointment is that glibc failed to implement absolute path support, so you might need to use versioned locale names instead, or see if the LOCPATH environment variable can be swizzled around without confusing glibc's locale cache. Then wouldn't be fundamentally different than the hypothesised multi-version ICU case: you could probably come up with different isalpha_l() results for different locales because you have different LC_CTYPE versions (for example Unicode 15.0 added new extended Cyrillic characters 1E030..1E08F, they look alphabetical to me but what would I know). That is an extremely hypothetical pie-in-the-sky thought and I don't know if it'd really work very well, but it is a concrete way that someone might finish up getting different answers out of isalpha_l(), to observe that it really is localised. And localized.