Thanks, I do know about the docs you quoted. Thanks for pointing me
to the comment in the code.
I've posted an issue (a request to make the documentation match the
code) at the TRE repository:
https://github.com/laurikari/tre/issues/88
On 2023-06-01 5:53 a.m., Tomas Kalibera wrote:
On 5/30/23 17:45, Ben Bolker wrote:
Inspired by this old Stack Overflow question
https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
I was wondering why this is TRUE:
Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")
TRE's documentation at
<https://laurikari.net/tre/documentation/regex-syntax/> says that a
range "is shorthand for the full range of characters between those two
[endpoints] (inclusive) in the collating sequence".
Yet, T is *not* between A and Z in the Estonian collating sequence:
sort(LETTERS)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
"Q" "R" "S"
[20] "Z" "T" "U" "V" "W" "X" "Y"
I realize that this may be a question about TRE rather than about R
*per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so
the question also applies to PCRE), but I'm wondering if anyone has
any insights ... (and yes, I know that the correct answer is "use
[:alpha:] and don't worry about it")
The correct answer depends on what you want to do, but please see
?regexp in R:
"Because their interpretation is locale- and implementation-dependent,
character ranges are best avoided."
and
"The only portable way to specify all ASCII letters is to list them all
as the character class
‘[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]’."
This is from POSIX specification:
"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched.
A range expression shall be expressed as the starting point and the
ending point separated by a <hyphen-minus> ( '-' )."
If you really want to know why the current implementation of R, TRE and
PCRE2 works in a certain way, you can check the code, but I don't think
it would be a good use of the time given what is written above.
It may be that TRE has a bug, maybe it doesn't do what was intended (see
comment "XXX - Should use collation order instead of encoding values in
character ranges." in the code), but I didn't check the code thoroughly.
Best
Tomas
(In contrast, the ICU engine underlying stringi/stringr says "[t]he
characters to include are determined by Unicode code point ordering" -
see
https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
for links)
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel