> -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > On Behalf Of Ted Harding > Sent: Friday, May 28, 2010 1:15 PM > To: r-help@r-project.org > Cc: carslaw > Subject: Re: [R] difference in sort order linux/Windows (R.2.11.0) > > On 28-May-10 14:37:39, Duncan Murdoch wrote: > > On 28/05/2010 9:24 AM, (Ted Harding) wrote: > >> An experiment: > >> > >> sort(c("AACD","A CD")) > >> # [1] "AACD" "A CD" > >> > >> sort(c("ABCD","A CD")) > >> # [1] "ABCD" "A CD" > >> > >> sort(c("ACCD","A CD")) > >> # [1] "ACCD" "A CD" > >> > >> sort(c("ADCD","A CD")) > >> # [1] "A CD" "ADCD" > >> > >> sort(c("AECD","A CD")) > >> # [1] "A CD" "AECD" > >> ## (with results for "AFCD", ... "AZCD" similar to the last two). > >> > >> LC_COLLATE=en_GB.UTF-8 > >> > >> (R version 2.11.0 (2010-04-22) on Linux). > >> > >> So this behaves, in en_GB.UTF-8, as though " " (SPACE) is between > >> "C" and "D". > >> > >> This is nuts!!! > >> > >> Curable if I set (e.g.) LC_LOCALE="C" on startup. But what else > >> might break if I do so? > >> > > > > You have to realize that to a large extent this is not under our > > control. Your system will have linked to some library (outside of R) > > to do string collation, and the problem lies in that library. You > > should determine which system library is handling your collations. > > > > I'd like to tell you how to do that, but I don't know for your build. > > You can find out if you're using the recommended ICU library by > > running example(icuSetCollate); that gives a number of warnings like > > > > In icuSetCollate(locale = "da_DK", case_first = "default") : > > ICU is not supported on this build > > > > in Windows. If you don't see those, then you want to talk to the ICU > > people. If you do, then you'll need to look deeper to find out what > > you're actually using. > > > > Duncan Murdoch > > Thanks for the further guidance, Duncan. I indeed get 4 such warnings > from example(icuSetCollate), indicating that ICU is not being used. > > I have now thrown the above experiment straight at Linux, entering > command-line commands as follows (with the results shown on the > lines starting with "#"): > > sort << EOT > "AACD" > "A CD" > EOT > # "AACD" > # "A CD" > > sort << EOT > "ABCD" > "A CD" > EOT > # "ABCD" > # "A CD" > > sort << EOT > "ACCD" > "A CD" > EOT > # "ACCD" > # "A CD" > > sort << EOT > "ADCD" > "A CD" > EOT > # "A CD" > # "ADCD" > > This clearly shows that the Linux collating order sees " " (SPACE) > as coming between "C" and "D", as when I tried it in R. > > I am now spamming my Linux contacts about it! > > The result of the "locale" command in Linux includes: > LC_COLLATE="en_GB.UTF-8" > > This happens consistently on a Debian Lenny and a Debian Etch system. > > Thanks, > Ted. >
Maybe asking on R-sig-Debian could be of some help. https://stat.ethz.ch/mailman/listinfo/r-sig-debian Hope this is helpful, Dan Daniel Nordlund Bothell, WA USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.