Re: encoding affects ICU regex character classification

Thomas Munro Sat, 09 Dec 2023 13:40:37 -0800

On Sat, Dec 2, 2023 at 9:49 AM Jeff Davis <pg...@j-davis.com> wrote:
> Your definition is too wide in my opinion, because it mixes together
> different sources of variation that are best left separate:
>  a. region/language
>  b. technical requirements
>  c. versioning
>  d. implementation variance
>
> (a) is not a true source of variation (please correct me if I'm wrong)
>
> (b) is perhaps interesting. The "C" locale is one example, and perhaps
> there are others, but I doubt very many others that we want to support.
>
> (c) is not a major concern in my opinion. The impact of Unicode changes
> is usually not dramatic, and it only affects regexes so it's much more
> contained than collation, for example. And if you really care, just use
> the "C" locale.
>
> (d) is mostly a bug


I get you.  I was mainly commenting on what POSIX APIs allow, which is
much wider than what you might observe on <your local libc>, and also
end-user-customisable.  But I agree that Unicode is all-pervasive and
authoritative in practice, to the point that if your libc disagrees
with it, it's probably just wrong.  (I guess site-local locales were
essential for bootstrapping in the early days of computers in a
language/territory but I can't find much discussion of the tools being
used by non-libc-maintainers today.)

> I think we only need 2 main character classification schemes: "C" and
> Unicode (TR #18 Compatibility Properties[1], either the "Standard"
> variant or the "POSIX Compatible" variant or both). The libc and ICU
> ones should be there only for compatibility and discouraged and
> hopefully eventually removed.

How would you specify what you want?  As with collating, I like the
idea of keeping support for libc even if it is terrible (some libcs
more than others) and eventually not the default, because I think
optional agreement with other software on the same host is a feature.

In the regex code we see not only class membership tests eg
iswlower_l(), but also conversions eg towlower_l().  Unless you also
implement built-in case mapping, you'd still have to call libc or ICU
for that, right?  It seems a bit strange to use different systems for
classification and mapping.  If you do implement mapping too, you have
to decide if you believe it is language-dependent or not, I think?

Hmm, let's see what we're doing now... for ICU the regex code is using
"simple" case mapping functions like u_toupper(c) that don't take a
locale, so no Turkish i/İ conversion for you, unlike our SQL
upper()/lower(), which this is supposed to agree with according to the
comments at the top.  I see why: POSIX can only do one-by-one
character mappings (which cannot handle Greek's context-sensitive
Σ->σ/ς or German's multi-character ß->SS), while ICU offers only
language-aware "full" string conversation (which does not guarantee
1:1 mapping for each character in a string) OR non-language-aware
"simple" character conversion (which does not handle Turkish's i->İ).
ICU has no middle ground for language-aware mapping with just the 1:1
cases only, probably because that doesn't really make total sense as a
concept (as I assume Greek speakers would agree).

> > > Not knowing anything about how glibc generates its charmaps,
> > > Unicode
> > > or pre-Unicode, I could take a wild guess that maybe in LATIN9 they
> > > have an old hand-crafted table, but for UTF-8 encoding it's fully
> > > outsourced to Unicode, and that's why you see a difference.
>
> No, the problem is that we're passing a pg_wchar to an ICU function
> that expects a 32-bit code point. Those two things are equivalent in
> the UTF8 encoding, but not in the LATIN9 encoding.

Ah right, I get that now (sorry, I confused myself by forgetting we
were talking about ICU).

Re: encoding affects ICU regex character classification

Reply via email to