On 13/08/2015 15:19, peter dalgaard wrote:
Yes, collation is a strange thing, and?
And remember that on some platforms (including yours) ICU is used, so
LC_COLLATE is not particularly relevant (unless it is 'C'). See
?Comparisons and ?icuGetCollate.
E.g. on my Yosemite system in en_US.UTF-8
rank(c(x, y))
[1] 1.5 1.5
icuGetCollate()
[1] "root"
icuSetCollate(locale="ASCII")
rank(c(x, y))
[1] 2 1
whereas on Fedora 21
rank(c(x, y))
[1] 2 1
icuGetCollate()
[1] "root"
Collation order will depend on locale settings, and there are quite a few cases
where the collation order of two items is not defined.
To add to the confusion, on OSX Mavericks, I see
x <- "\u0663"
y <- 3
x == y
[1] FALSE
rank(c(x, y))
[1] 2 1
x
[1] "٣"
x == y
[1] FALSE
x > y
[1] TRUE
x < y
[1] FALSE
Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8"
Notice the differences from en_US.UTF8 (sans hyphen) on your system....
-pd
On 13 Aug 2015, at 16:01 , John McKown <john.archie.mck...@gmail.com> wrote:
2015-08-13 8:39 GMT-05:00 Hadley Wickham <h.wick...@gmail.com>:
x <- "\u0663"
y <- 3
x == y
# FALSE
rank(c(x, y))
# c(1.5, 1.5)
also interesting, and confusing to me:
x == y
[1] FALSE
x > y
[1] FALSE
x < y
[1] FALSE
With some slight changes:
x <- "\u0663"
y <- "3"
xy <- c(x,y)
rank(xy);
[1] 1.5 1.5
Sys.getlocale();
[1]
"LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
Sys.setlocale(category="LC_COLLATE", locale="C");
[1] "C"
rank(xy);
[1] 2 1
--
Brian D. Ripley, rip...@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel