Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

2023-06-01 Thread Martin Maechler
> Ben Bolker 
> on Tue, 30 May 2023 11:45:20 -0400 writes:

> Inspired by this old Stack Overflow question

> 
https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

> I was wondering why this is TRUE:

> Sys.setlocale("LC_ALL", "et_EE")
> grepl("[A-Z]", "T")

> TRE's documentation at 
>  says that a 
> range "is shorthand for the full range of characters between those two 
> [endpoints] (inclusive) in the collating sequence".

> Yet, T is *not* between A and Z in the Estonian collating sequence:

> sort(LETTERS)
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" 
> "Q" "R" "S"
> [20] "Z" "T" "U" "V" "W" "X" "Y"

> I realize that this may be a question about TRE rather than about R 
> *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so 
> the question also applies to PCRE), but I'm wondering if anyone has any 
> insights ...  (and yes, I know that the correct answer is "use [:alpha:] 
> and don't worry about it")

> (In contrast, the ICU engine underlying stringi/stringr says "[t]he 
> characters to include are determined by Unicode code point ordering" - see

> 
https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

> for links)

Your last ()  may point to the solution of the riddle:
Nowadays, typically in R

> capabilities()[["ICU"]]
[1] TRUE

but of course now one has to study if / why  ICU seems to take
precedence over the locale's internal "sort"ing ..


Best regards,
Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

2023-06-01 Thread Tomas Kalibera



On 5/30/23 17:45, Ben Bolker wrote:

Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions 



I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at 
 says that a 
range "is shorthand for the full range of characters between those two 
[endpoints] (inclusive) in the collating sequence".


Yet, T is *not* between A and Z in the Estonian collating sequence:

 sort(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" 
"Q" "R" "S"

[20] "Z" "T" "U" "V" "W" "X" "Y"

  I realize that this may be a question about TRE rather than about R 
*per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so 
the question also applies to PCRE), but I'm wondering if anyone has 
any insights ...  (and yes, I know that the correct answer is "use 
[:alpha:] and don't worry about it")


The correct answer depends on what you want to do, but please see 
?regexp in R:


"Because their interpretation is locale- and implementation-dependent, 
character ranges are best avoided."


and

"The only portable way to specify all ASCII letters is to list them all 
as the character class

‘[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]’."

This is from POSIX specification:

"In the POSIX locale, a range expression represents the set of collating 
elements that fall between two elements in the collation sequence, 
inclusive. In other locales, a range expression has unspecified 
behavior: strictly conforming applications shall not rely on whether the 
range expression is valid, or on the set of collating elements matched. 
A range expression shall be expressed as the starting point and the 
ending point separated by a  ( '-' )."


If you really want to know why the current implementation of R, TRE and 
PCRE2 works in a certain way, you can check the code, but I don't think 
it would be a good use of the time given what is written above.


It may be that TRE has a bug, maybe it doesn't do what was intended (see 
comment "XXX - Should use collation order instead of encoding values in 
character ranges." in the code), but I didn't check the code thoroughly.


Best
Tomas



(In contrast, the ICU engine underlying stringi/stringr says "[t]he 
characters to include are determined by Unicode code point ordering" - 
see


https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163 



for links)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

2023-06-03 Thread Ben Bolker
  Thanks, I do know about the docs you quoted.  Thanks for pointing me 
to the comment in the code.


 I've posted an issue (a request to make the documentation match the 
code) at the TRE repository:


https://github.com/laurikari/tre/issues/88


On 2023-06-01 5:53 a.m., Tomas Kalibera wrote:


On 5/30/23 17:45, Ben Bolker wrote:

Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at 
 says that a 
range "is shorthand for the full range of characters between those two 
[endpoints] (inclusive) in the collating sequence".


Yet, T is *not* between A and Z in the Estonian collating sequence:

 sort(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" 
"Q" "R" "S"

[20] "Z" "T" "U" "V" "W" "X" "Y"

  I realize that this may be a question about TRE rather than about R 
*per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so 
the question also applies to PCRE), but I'm wondering if anyone has 
any insights ...  (and yes, I know that the correct answer is "use 
[:alpha:] and don't worry about it")


The correct answer depends on what you want to do, but please see 
?regexp in R:


"Because their interpretation is locale- and implementation-dependent, 
character ranges are best avoided."


and

"The only portable way to specify all ASCII letters is to list them all 
as the character class

‘[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]’."

This is from POSIX specification:

"In the POSIX locale, a range expression represents the set of collating 
elements that fall between two elements in the collation sequence, 
inclusive. In other locales, a range expression has unspecified 
behavior: strictly conforming applications shall not rely on whether the 
range expression is valid, or on the set of collating elements matched. 
A range expression shall be expressed as the starting point and the 
ending point separated by a  ( '-' )."


If you really want to know why the current implementation of R, TRE and 
PCRE2 works in a certain way, you can check the code, but I don't think 
it would be a good use of the time given what is written above.


It may be that TRE has a bug, maybe it doesn't do what was intended (see 
comment "XXX - Should use collation order instead of encoding values in 
character ranges." in the code), but I didn't check the code thoroughly.


Best
Tomas



(In contrast, the ICU engine underlying stringi/stringr says "[t]he 
characters to include are determined by Unicode code point ordering" - 
see


https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

2023-06-16 Thread peter dalgaard
Just for amusement: Similar messups occur with Danish and its three extra 
letters:

> Sys.setlocale("LC_ALL", "da_DK")
[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"
> sort(c(LETTERS,"Æ","Ø","Å"))
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Æ" "Ø" "Å"

> grepl("[A-Å]", "Ø")
[1] FALSE
> grepl("[A-Å]", "Æ")
[1] FALSE
> grepl("[A-Æ]", "Å")
[1] TRUE
> grepl("[A-Æ]", "Ø")
[1] FALSE
> grepl("[A-Ø]", "Å")
[1] TRUE
> grepl("[A-Ø]", "Æ")
[1] TRUE

So for character ranges, the order is Å,Æ,Ø (which is how they'd collate in 
Swedish, except that Swedish uses diacriticals rather than Æ and Ø).

> Sys.setlocale("LC_ALL", "sv_SE")
[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"
> sort(c(LETTERS,"Æ","Ø","Å"))
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Æ" "Ø"
> sort(c(LETTERS,"Ä","Ö","Å"))
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Ä" "Ö"



> On 30 May 2023, at 17:45 , Ben Bolker  wrote:
> 
>  Inspired by this old Stack Overflow question
> 
> https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
> 
> I was wondering why this is TRUE:
> 
> Sys.setlocale("LC_ALL", "et_EE")
> grepl("[A-Z]", "T")
> 
> TRE's documentation at 
>  says that a range "is 
> shorthand for the full range of characters between those two [endpoints] 
> (inclusive) in the collating sequence".
> 
> Yet, T is *not* between A and Z in the Estonian collating sequence:
> 
> sort(LETTERS)
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" 
> "S"
> [20] "Z" "T" "U" "V" "W" "X" "Y"
> 
>  I realize that this may be a question about TRE rather than about R *per se* 
> (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question 
> also applies to PCRE), but I'm wondering if anyone has any insights ...  (and 
> yes, I know that the correct answer is "use [:alpha:] and don't worry about 
> it")
> 
> (In contrast, the ICU engine underlying stringi/stringr says "[t]he 
> characters to include are determined by Unicode code point ordering" - see
> 
> https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
> 
> for links)
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

2023-06-16 Thread Ben Bolker

  Yes.
  FWIW I submitted a request for a documentation fix to TRE (to 
document that it actually uses Unicode order, not collation order, to 
define ranges, just like most (but not all) other regex engines ...)


https://github.com/laurikari/tre/issues/88

On 2023-06-16 5:16 a.m., peter dalgaard wrote:

Just for amusement: Similar messups occur with Danish and its three extra 
letters:


Sys.setlocale("LC_ALL", "da_DK")

[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"

sort(c(LETTERS,"Æ","Ø","Å"))

  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" 
"S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Æ" "Ø" "Å"


grepl("[A-Å]", "Ø")

[1] FALSE

grepl("[A-Å]", "Æ")

[1] FALSE

grepl("[A-Æ]", "Å")

[1] TRUE

grepl("[A-Æ]", "Ø")

[1] FALSE

grepl("[A-Ø]", "Å")

[1] TRUE

grepl("[A-Ø]", "Æ")

[1] TRUE

So for character ranges, the order is Å,Æ,Ø (which is how they'd collate in 
Swedish, except that Swedish uses diacriticals rather than Æ and Ø).


Sys.setlocale("LC_ALL", "sv_SE")

[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"

sort(c(LETTERS,"Æ","Ø","Å"))

  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" 
"S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Æ" "Ø"

sort(c(LETTERS,"Ä","Ö","Å"))

  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" 
"S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Ä" "Ö"




On 30 May 2023, at 17:45 , Ben Bolker  wrote:

  Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at  says that 
a range "is shorthand for the full range of characters between those two [endpoints] 
(inclusive) in the collating sequence".

Yet, T is *not* between A and Z in the Estonian collating sequence:

sort(LETTERS)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "Z" "T" "U" "V" "W" "X" "Y"

  I realize that this may be a question about TRE rather than about R *per se* (FWIW the 
grepl() result is also TRUE with `perl = TRUE`, so the question also applies to PCRE), 
but I'm wondering if anyone has any insights ...  (and yes, I know that the correct 
answer is "use [:alpha:] and don't worry about it")

(In contrast, the ICU engine underlying stringi/stringr says "[t]he characters to 
include are determined by Unicode code point ordering" - see

https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
> E-mail is sent at my convenience; I don't expect replies outside of 
working hours.


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel