Re: encoding affects ICU regex character classification

Jeff Davis Tue, 12 Dec 2023 13:40:22 -0800

On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:

> 
> How would you specify what you want?


One proposal would be to have a builtin collation provider:

https://postgr.es/m/[email protected]

I don't think there are very many ctype options, but they could be
specified as part of the locale, or perhaps even as some provider-
specific options specified at CREATE COLLATION time.

> As with collating, I like the
> idea of keeping support for libc even if it is terrible (some libcs
> more than others) and eventually not the default, because I think
> optional agreement with other software on the same host is a feature.

Of course we should keep the libc support around. I'm not sure how
relevant such a feature is, but I don't think we actually have to
remove it.

> Unless you also
> implement built-in case mapping, you'd still have to call libc or ICU
> for that, right?

We can do built-in case mapping, see:

https://postgr.es/m/[email protected]

>   It seems a bit strange to use different systems for
> classification and mapping.  If you do implement mapping too, you
> have
> to decide if you believe it is language-dependent or not, I think?

A complete solution would need to do the language-dependent case
mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
and only a handful of mapping changes, so we can handle that with the
builtin provider as well.

> Hmm, let's see what we're doing now... for ICU the regex code is
> using
> "simple" case mapping functions like u_toupper(c) that don't take a
> locale, so no Turkish i/İ conversion for you, unlike our SQL
> upper()/lower(), which this is supposed to agree with according to
> the
> comments at the top.  I see why: POSIX can only do one-by-one
> character mappings (which cannot handle Greek's context-sensitive
> Σ->σ/ς or German's multi-character ß->SS)

Regexes are inherently character-by-character, so transformations like
ß->SS are not going to work for case-insensitive regex matching
regardless of the provider.

Σ->σ/ς does make sense, and what we have seems to be just broken:

  select 'ς' ~* 'Σ'; -- false in both libc and ICU
  select 'Σ' ~* 'ς'; -- true in both libc and ICU

Similarly for titlecase variants:

  select 'ǅ' ~* 'ǆ'; -- false in libc and ICU
  select 'ǆ' ~* 'ǅ'; -- true in libc and ICU

If we do the case mapping ourselves, we can make those work. We'd just
have to modify the APIs a bit so that allcases() can actually get all
of the case variants, rather than relying on just towupper/towlower.


Regards,
        Jeff Davis

Re: encoding affects ICU regex character classification

Reply via email to