On Sat, Dec 2, 2023 at 9:49 AM Jeff Davis <pg...@j-davis.com> wrote: > Your definition is too wide in my opinion, because it mixes together > different sources of variation that are best left separate: > a. region/language > b. technical requirements > c. versioning > d. implementation variance > > (a) is not a true source of variation (please correct me if I'm wrong) > > (b) is perhaps interesting. The "C" locale is one example, and perhaps > there are others, but I doubt very many others that we want to support. > > (c) is not a major concern in my opinion. The impact of Unicode changes > is usually not dramatic, and it only affects regexes so it's much more > contained than collation, for example. And if you really care, just use > the "C" locale. > > (d) is mostly a bug
I get you. I was mainly commenting on what POSIX APIs allow, which is much wider than what you might observe on <your local libc>, and also end-user-customisable. But I agree that Unicode is all-pervasive and authoritative in practice, to the point that if your libc disagrees with it, it's probably just wrong. (I guess site-local locales were essential for bootstrapping in the early days of computers in a language/territory but I can't find much discussion of the tools being used by non-libc-maintainers today.) > I think we only need 2 main character classification schemes: "C" and > Unicode (TR #18 Compatibility Properties[1], either the "Standard" > variant or the "POSIX Compatible" variant or both). The libc and ICU > ones should be there only for compatibility and discouraged and > hopefully eventually removed. How would you specify what you want? As with collating, I like the idea of keeping support for libc even if it is terrible (some libcs more than others) and eventually not the default, because I think optional agreement with other software on the same host is a feature. In the regex code we see not only class membership tests eg iswlower_l(), but also conversions eg towlower_l(). Unless you also implement built-in case mapping, you'd still have to call libc or ICU for that, right? It seems a bit strange to use different systems for classification and mapping. If you do implement mapping too, you have to decide if you believe it is language-dependent or not, I think? Hmm, let's see what we're doing now... for ICU the regex code is using "simple" case mapping functions like u_toupper(c) that don't take a locale, so no Turkish i/İ conversion for you, unlike our SQL upper()/lower(), which this is supposed to agree with according to the comments at the top. I see why: POSIX can only do one-by-one character mappings (which cannot handle Greek's context-sensitive Σ->σ/ς or German's multi-character ß->SS), while ICU offers only language-aware "full" string conversation (which does not guarantee 1:1 mapping for each character in a string) OR non-language-aware "simple" character conversion (which does not handle Turkish's i->İ). ICU has no middle ground for language-aware mapping with just the 1:1 cases only, probably because that doesn't really make total sense as a concept (as I assume Greek speakers would agree). > > > Not knowing anything about how glibc generates its charmaps, > > > Unicode > > > or pre-Unicode, I could take a wild guess that maybe in LATIN9 they > > > have an old hand-crafted table, but for UTF-8 encoding it's fully > > > outsourced to Unicode, and that's why you see a difference. > > No, the problem is that we're passing a pg_wchar to an ICU function > that expects a 32-bit code point. Those two things are equivalent in > the UTF8 encoding, but not in the LATIN9 encoding. Ah right, I get that now (sorry, I confused myself by forgetting we were talking about ICU).