Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

Ben Bolker Sat, 03 Jun 2023 08:34:43 -0700

Thanks, I do know about the docs you quoted. Thanks for pointing meto the comment in the code.

I've posted an issue (a request to make the documentation match thecode) at the TRE repository:


https://github.com/laurikari/tre/issues/88


On 2023-06-01 5:53 a.m., Tomas Kalibera wrote:

On 5/30/23 17:45, Ben Bolker wrote:
Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")
TRE's documentation at<https://laurikari.net/tre/documentation/regex-syntax/> says that arange "is shorthand for the full range of characters between those two[endpoints] (inclusive) in the collating sequence".
Yet, T is *not* between A and Z in the Estonian collating sequence:

 sort(LETTERS)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P""Q" "R" "S"
[20] "Z" "T" "U" "V" "W" "X" "Y"
I realize that this may be a question about TRE rather than about R*per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, sothe question also applies to PCRE), but I'm wondering if anyone hasany insights ... (and yes, I know that the correct answer is "use[:alpha:] and don't worry about it")
The correct answer depends on what you want to do, but please see?regexp in R:
"Because their interpretation is locale- and implementation-dependent,character ranges are best avoided."
and
"The only portable way to specify all ASCII letters is to list them allas the character class
‘[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]’."

This is from POSIX specification:
"In the POSIX locale, a range expression represents the set of collatingelements that fall between two elements in the collation sequence,inclusive. In other locales, a range expression has unspecifiedbehavior: strictly conforming applications shall not rely on whether therange expression is valid, or on the set of collating elements matched.A range expression shall be expressed as the starting point and theending point separated by a <hyphen-minus> ( '-' )."
If you really want to know why the current implementation of R, TRE andPCRE2 works in a certain way, you can check the code, but I don't thinkit would be a good use of the time given what is written above.
It may be that TRE has a bug, maybe it doesn't do what was intended (seecomment "XXX - Should use collation order instead of encoding values incharacter ranges." in the code), but I didn't check the code thoroughly.
Best
Tomas
(In contrast, the ICU engine underlying stringi/stringr says "[t]hecharacters to include are determined by Unicode code point ordering" -see
https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

Reply via email to