On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote: > > How would you specify what you want?
One proposal would be to have a builtin collation provider: https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.ca...@j-davis.com I don't think there are very many ctype options, but they could be specified as part of the locale, or perhaps even as some provider- specific options specified at CREATE COLLATION time. > As with collating, I like the > idea of keeping support for libc even if it is terrible (some libcs > more than others) and eventually not the default, because I think > optional agreement with other software on the same host is a feature. Of course we should keep the libc support around. I'm not sure how relevant such a feature is, but I don't think we actually have to remove it. > Unless you also > implement built-in case mapping, you'd still have to call libc or ICU > for that, right? We can do built-in case mapping, see: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com > It seems a bit strange to use different systems for > classification and mapping. If you do implement mapping too, you > have > to decide if you believe it is language-dependent or not, I think? A complete solution would need to do the language-dependent case mapping. But that seems to only be 3 locales ("az", "lt", and "tr"), and only a handful of mapping changes, so we can handle that with the builtin provider as well. > Hmm, let's see what we're doing now... for ICU the regex code is > using > "simple" case mapping functions like u_toupper(c) that don't take a > locale, so no Turkish i/İ conversion for you, unlike our SQL > upper()/lower(), which this is supposed to agree with according to > the > comments at the top. I see why: POSIX can only do one-by-one > character mappings (which cannot handle Greek's context-sensitive > Σ->σ/ς or German's multi-character ß->SS) Regexes are inherently character-by-character, so transformations like ß->SS are not going to work for case-insensitive regex matching regardless of the provider. Σ->σ/ς does make sense, and what we have seems to be just broken: select 'ς' ~* 'Σ'; -- false in both libc and ICU select 'Σ' ~* 'ς'; -- true in both libc and ICU Similarly for titlecase variants: select 'Dž' ~* 'dž'; -- false in libc and ICU select 'dž' ~* 'Dž'; -- true in libc and ICU If we do the case mapping ourselves, we can make those work. We'd just have to modify the APIs a bit so that allcases() can actually get all of the case variants, rather than relying on just towupper/towlower. Regards, Jeff Davis