Re: encoding affects ICU regex character classification
On Fri, 2023-12-15 at 16:48 -0800, Jeremy Schneider wrote: > This goes back to my other thread (which sadly got very little > discussion): PosgreSQL really needs to be safe by /default/ Doesn't a built-in provider help create a safer option? The built-in provider's version of Unicode will be consistent with unicode_assigned(), which is a first step toward rejecting code points that the provider doesn't understand. And by rejecting unassigned code points, we get all kinds of Unicode compatibility guarantees that avoid the kinds of change risks that you are worried about. Regards, Jeff Davis
Re: encoding affects ICU regex character classification
On Sat, Dec 16, 2023 at 1:48 PM Jeremy Schneider wrote: > On 12/14/23 7:12 AM, Jeff Davis wrote: > > The concern over unassigned code points is misplaced. The application > > may be aware of newly-assigned code points, and there's no way they > > will be mapped correctly in Postgres if the provider is not aware of > > those code points. The user can either proceed in using unassigned code > > points and accept the risk of future changes, or wait for the provider > > to be upgraded. > > This does not seem to me like a good way to view the situation. > > Earlier this summer, a day or two after writing a document, I was > completely surprised to open it on my work computer and see "unknown > character" boxes. When I had previously written the document on my home > computer and when I had viewed it from my cell phone, everything was > fine. Apple does a very good job of always keeping iPhones and MacOS > versions up-to-date with the latest versions of Unicode and latest > characters. iPhone keyboards make it very easy to access any character. > Emojis are the canonical example here. My work computer was one major > version of MacOS behind my home computer. That "SQUARE ERA NAME REIWA" codepoint we talked about in one of the multi-version ICU threads was an interesting case study. It's not an emoji, it entered real/serious use suddenly, landed in a quickly wrapped minor release of Unicode, and then arrived in locale definitions via regular package upgrades on various OSes AFAICT (ie didn't require a major version upgrade of the OS). https://en.wikipedia.org/wiki/Reiwa_era#Announcement https://en.wikipedia.org/wiki/Reiwa_era#Technology https://unicode.org/versions/Unicode12.1.0/
Re: encoding affects ICU regex character classification
On 12/14/23 7:12 AM, Jeff Davis wrote: > The concern over unassigned code points is misplaced. The application > may be aware of newly-assigned code points, and there's no way they > will be mapped correctly in Postgres if the provider is not aware of > those code points. The user can either proceed in using unassigned code > points and accept the risk of future changes, or wait for the provider > to be upgraded. This does not seem to me like a good way to view the situation. Earlier this summer, a day or two after writing a document, I was completely surprised to open it on my work computer and see "unknown character" boxes. When I had previously written the document on my home computer and when I had viewed it from my cell phone, everything was fine. Apple does a very good job of always keeping iPhones and MacOS versions up-to-date with the latest versions of Unicode and latest characters. iPhone keyboards make it very easy to access any character. Emojis are the canonical example here. My work computer was one major version of MacOS behind my home computer. And I'm probably one of a few people on this hackers email list who even understands what the words "unassigned code point" mean. Generally DBAs, sysadmins, architects and developers who are all part of the tangled web of building and maintaining systems which use PostgreSQL on their backend are never going to think about unicode characters proactively. This goes back to my other thread (which sadly got very little discussion): PosgreSQL really needs to be safe by /default/ ... having GUCs is fine though; we can put explanation in the docs about what users should consider if they change a setting. -Jeremy -- http://about.me/jeremy_schneider
Re: encoding affects ICU regex character classification
On Tue, 2023-12-12 at 14:35 -0800, Jeremy Schneider wrote: > Is someone able to test out upper & lower functions on U+A7BA ... > U+A7BF > across a few libs/versions? Those code points are unassigned in Unicode 11.0 and assigned in Unicode 12.0. In ICU 63-2 (based on Unicode 11.0), they just get mapped to themselves. In ICU 64-2 (based on Unicode 12.1) they get mapped the same way the builtin CTYPE maps them (based on Unicode 15.1). The concern over unassigned code points is misplaced. The application may be aware of newly-assigned code points, and there's no way they will be mapped correctly in Postgres if the provider is not aware of those code points. The user can either proceed in using unassigned code points and accept the risk of future changes, or wait for the provider to be upgraded. If the user doesn't have many expression indexes dependent on ctype behavior, it doesn't matter much. If they do have such indexes, the best we can offer is a controlled process, and the builtin provider allows the most visibility and control. (Aside: case mapping has very strong compatibility guarantees, but not perfect. For better compatibility guarantees, we should support case folding.) > And I have no idea if or when > glibc might have picked up the new unicode characters. That's a strong argument in favor of a builtin provider. Regards, Jeff Davis
Re: encoding affects ICU regex character classification
On 12/12/23 1:39 PM, Jeff Davis wrote: > On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote: >> Unless you also >> implement built-in case mapping, you'd still have to call libc or ICU >> for that, right? > > We can do built-in case mapping, see: > > https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com > >> It seems a bit strange to use different systems for >> classification and mapping. If you do implement mapping too, you >> have >> to decide if you believe it is language-dependent or not, I think? > > A complete solution would need to do the language-dependent case > mapping. But that seems to only be 3 locales ("az", "lt", and "tr"), > and only a handful of mapping changes, so we can handle that with the > builtin provider as well. This thread has me second-guessing the reply I just sent on the other thread. Is someone able to test out upper & lower functions on U+A7BA ... U+A7BF across a few libs/versions? Theoretically the upper/lower behavior should change in ICU between Ubuntu 18.04 LTS and Ubuntu 20.04 LTS (specifically in ICU 64 / Unicode 12). And I have no idea if or when glibc might have picked up the new unicode characters. -Jeremy -- http://about.me/jeremy_schneider
Re: encoding affects ICU regex character classification
On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote: > > How would you specify what you want? One proposal would be to have a builtin collation provider: https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.ca...@j-davis.com I don't think there are very many ctype options, but they could be specified as part of the locale, or perhaps even as some provider- specific options specified at CREATE COLLATION time. > As with collating, I like the > idea of keeping support for libc even if it is terrible (some libcs > more than others) and eventually not the default, because I think > optional agreement with other software on the same host is a feature. Of course we should keep the libc support around. I'm not sure how relevant such a feature is, but I don't think we actually have to remove it. > Unless you also > implement built-in case mapping, you'd still have to call libc or ICU > for that, right? We can do built-in case mapping, see: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com > It seems a bit strange to use different systems for > classification and mapping. If you do implement mapping too, you > have > to decide if you believe it is language-dependent or not, I think? A complete solution would need to do the language-dependent case mapping. But that seems to only be 3 locales ("az", "lt", and "tr"), and only a handful of mapping changes, so we can handle that with the builtin provider as well. > Hmm, let's see what we're doing now... for ICU the regex code is > using > "simple" case mapping functions like u_toupper(c) that don't take a > locale, so no Turkish i/İ conversion for you, unlike our SQL > upper()/lower(), which this is supposed to agree with according to > the > comments at the top. I see why: POSIX can only do one-by-one > character mappings (which cannot handle Greek's context-sensitive > Σ->σ/ς or German's multi-character ß->SS) Regexes are inherently character-by-character, so transformations like ß->SS are not going to work for case-insensitive regex matching regardless of the provider. Σ->σ/ς does make sense, and what we have seems to be just broken: select 'ς' ~* 'Σ'; -- false in both libc and ICU select 'Σ' ~* 'ς'; -- true in both libc and ICU Similarly for titlecase variants: select 'Dž' ~* 'dž'; -- false in libc and ICU select 'dž' ~* 'Dž'; -- true in libc and ICU If we do the case mapping ourselves, we can make those work. We'd just have to modify the APIs a bit so that allcases() can actually get all of the case variants, rather than relying on just towupper/towlower. Regards, Jeff Davis
Re: encoding affects ICU regex character classification
On Sat, Dec 2, 2023 at 9:49 AM Jeff Davis wrote: > Your definition is too wide in my opinion, because it mixes together > different sources of variation that are best left separate: > a. region/language > b. technical requirements > c. versioning > d. implementation variance > > (a) is not a true source of variation (please correct me if I'm wrong) > > (b) is perhaps interesting. The "C" locale is one example, and perhaps > there are others, but I doubt very many others that we want to support. > > (c) is not a major concern in my opinion. The impact of Unicode changes > is usually not dramatic, and it only affects regexes so it's much more > contained than collation, for example. And if you really care, just use > the "C" locale. > > (d) is mostly a bug I get you. I was mainly commenting on what POSIX APIs allow, which is much wider than what you might observe on , and also end-user-customisable. But I agree that Unicode is all-pervasive and authoritative in practice, to the point that if your libc disagrees with it, it's probably just wrong. (I guess site-local locales were essential for bootstrapping in the early days of computers in a language/territory but I can't find much discussion of the tools being used by non-libc-maintainers today.) > I think we only need 2 main character classification schemes: "C" and > Unicode (TR #18 Compatibility Properties[1], either the "Standard" > variant or the "POSIX Compatible" variant or both). The libc and ICU > ones should be there only for compatibility and discouraged and > hopefully eventually removed. How would you specify what you want? As with collating, I like the idea of keeping support for libc even if it is terrible (some libcs more than others) and eventually not the default, because I think optional agreement with other software on the same host is a feature. In the regex code we see not only class membership tests eg iswlower_l(), but also conversions eg towlower_l(). Unless you also implement built-in case mapping, you'd still have to call libc or ICU for that, right? It seems a bit strange to use different systems for classification and mapping. If you do implement mapping too, you have to decide if you believe it is language-dependent or not, I think? Hmm, let's see what we're doing now... for ICU the regex code is using "simple" case mapping functions like u_toupper(c) that don't take a locale, so no Turkish i/İ conversion for you, unlike our SQL upper()/lower(), which this is supposed to agree with according to the comments at the top. I see why: POSIX can only do one-by-one character mappings (which cannot handle Greek's context-sensitive Σ->σ/ς or German's multi-character ß->SS), while ICU offers only language-aware "full" string conversation (which does not guarantee 1:1 mapping for each character in a string) OR non-language-aware "simple" character conversion (which does not handle Turkish's i->İ). ICU has no middle ground for language-aware mapping with just the 1:1 cases only, probably because that doesn't really make total sense as a concept (as I assume Greek speakers would agree). > > > Not knowing anything about how glibc generates its charmaps, > > > Unicode > > > or pre-Unicode, I could take a wild guess that maybe in LATIN9 they > > > have an old hand-crafted table, but for UTF-8 encoding it's fully > > > outsourced to Unicode, and that's why you see a difference. > > No, the problem is that we're passing a pg_wchar to an ICU function > that expects a 32-bit code point. Those two things are equivalent in > the UTF8 encoding, but not in the LATIN9 encoding. Ah right, I get that now (sorry, I confused myself by forgetting we were talking about ICU).
Re: encoding affects ICU regex character classification
On Thu, Nov 30, 2023 at 1:23 PM Jeff Davis wrote: > Character classification is not localized at all in libc or ICU as far > as I can tell. Really? POSIX isalpha()/isalpha_l() and friends clearly depend on a locale. See eg d522b05c for a case where that broke something. Perhaps you mean glibc wouldn't do that to you because you know that, as an unstandardised detail, it sucks in (some version of) Unicode's data which shouldn't vary between locales. But you are allowed to make your own locales, including putting whatever classifications you want into the LC_TYPE file using POSIX-standardised tools like localedef. Perhaps that is a bit of a stretch, and no one really does that in practice, but anyway it's still "localized". Not knowing anything about how glibc generates its charmaps, Unicode or pre-Unicode, I could take a wild guess that maybe in LATIN9 they have an old hand-crafted table, but for UTF-8 encoding it's fully outsourced to Unicode, and that's why you see a difference. Another problem seen in a few parts of our tree is that we sometimes feed individual UTF-8 bytes to the isXXX() functions which is about as well defined as trying to pay for a pint with the left half of a $10 bill. As for ICU, it's "not localized" only if there is only one ICU library in the universe, but of course different versions of ICU might give different answers because they correspond to different versions of Unicode (as do glibc versions, FreeBSD libc versions, etc) and also might disagree with tables built by PostgreSQL. Maybe irrelevant for now, but I think with thus-far-imagined variants of the multi-version ICU proposal, you have to choose whether to call u_isUAlphabetic() in the library we're linked against, or via the dlsym() we look up in a particular dlopen'd library. So I guess we'd have to access it via our pg_locale_t, so again it'd be "localized" by some definitions. Thinking about how to apply that thinking to libc, ... this is going to sound far fetched and handwavy but here goes: we could even imagine a multi-version system based on different base locale paths. Instead of using the system-provided locales under /usr/share/locale to look when we call newlocale(..., "en_NZ.UTF-8", ...), POSIX says we're allowed to specify an absolute path eg newlocale(..., "/foo/bar/unicode11/en_NZ.UTF-8", ...). If it is possible to use $DISTRO's localedef to compile $OLD_DISTRO's locale sources to get historical behaviour, that might provide a way to get them without assuming the binary format is stable (it definitely isn't, but the source format is nailed down by POSIX). One fly in the ointment is that glibc failed to implement absolute path support, so you might need to use versioned locale names instead, or see if the LOCPATH environment variable can be swizzled around without confusing glibc's locale cache. Then wouldn't be fundamentally different than the hypothesised multi-version ICU case: you could probably come up with different isalpha_l() results for different locales because you have different LC_CTYPE versions (for example Unicode 15.0 added new extended Cyrillic characters 1E030..1E08F, they look alphabetical to me but what would I know). That is an extremely hypothetical pie-in-the-sky thought and I don't know if it'd really work very well, but it is a concrete way that someone might finish up getting different answers out of isalpha_l(), to observe that it really is localised. And localized.
Re: encoding affects ICU regex character classification
Jeff Davis writes: > The problem seems to be confusion between pg_wchar and a unicode code > point in pg_wc_isalpha() and related functions. Yeah, that's an ancient sore spot: we don't really know what the representation of wchar is. We assume it's Unicode code points for UTF8 locales, but libc isn't required to do that AFAIK. See comment block starting about line 20 in regc_pg_locale.c. I doubt that ICU has much to do with this directly. We'd have to find an alternate source of knowledge to replace the functions if we wanted to fix it fully ... can ICU do that? regards, tom lane