On 13/08/2015 15:19, peter dalgaard wrote:
Yes, collation is a strange thing, and?

And remember that on some platforms (including yours) ICU is used, so LC_COLLATE is not particularly relevant (unless it is 'C'). See ?Comparisons and ?icuGetCollate.

E.g. on my Yosemite system in en_US.UTF-8

rank(c(x, y))
[1] 1.5 1.5
icuGetCollate()
[1] "root"
icuSetCollate(locale="ASCII")
rank(c(x, y))
[1] 2 1

whereas on Fedora 21

rank(c(x, y))
[1] 2 1
 icuGetCollate()
[1] "root"




Collation order will depend on locale settings, and there are quite a few cases 
where the collation order of two items is not defined.

To add to the confusion, on OSX Mavericks, I see

x <- "\u0663"
y <- 3

x == y
[1] FALSE
rank(c(x, y))
[1] 2 1
x
[1] "٣"
x == y
[1] FALSE
x > y
[1] TRUE
x < y
[1] FALSE

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8"

Notice the differences from en_US.UTF8 (sans hyphen) on your system....

-pd

On 13 Aug 2015, at 16:01 , John McKown <john.archie.mck...@gmail.com> wrote:

2015-08-13 8:39 GMT-05:00 Hadley Wickham <h.wick...@gmail.com>:

x <- "\u0663"
y <- 3

x == y
# FALSE
rank(c(x, y))
# c(1.5, 1.5)


​also interesting, and confusing to me:

x == y
[1] FALSE
x > y
[1] FALSE
x < y
[1] FALSE


With some slight changes:

x <- "\u0663"
y <- "3"
xy <- c(x,y)
rank(xy);
[1] 1.5 1.5
Sys.getlocale();
[1]
"LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
Sys.setlocale(category="LC_COLLATE", locale="C");
[1] "C"
rank(xy);
[1] 2 1




--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to