Re: [R] a question of alphabetical order

2008-04-16 Thread Hans-Joerg Bibiko
Hi,

as already mentioned, sorting could be a pain.

My solution to that is to write my own order routine for a given  
language.
The idea is to transform the UTF-8 string into ASCII in such a way  
that the built-in order routine outputs the desired result. But this  
could be a very stony way.

Example for Spanish (please correct me if I'm wrong):
-accents are ignored
-ll is one single entity and comes after l (ludar comes before llave)
-ch is one single entity and comes after c

The only thing I do not know if it could happen that a 'll' is not one  
entity but two (maybe the result of the combination of two nouns). If  
so then the entire story will be much more complicated.

Now the big question is how to delete all these accents in åàÿñü etc.  
to get aaynu. (technically spoken canonical decomposition of a Unicode  
string NFKD)
One possible way is to use a scripting language which can handle it.  
The only language I know  which can do it as default is python. For  
ruby, perl one has to install an additional library.

On a Mac system python is installed as default; on Windows not. If  
this ordering is also an issue for Windows users then one has to  
install it in beforehand.

The code comes here:

orderES - function(x) {
 #decomposes all accented characters
 str - NKFD(x)

 #all combining diacritics
 nonChars - c(768:879)
 pattern - paste([, intToUtf8(as.integer(nonChars)), ], sep=)

 #delete all combining diacritics
 str - gsub(pattern, , str)

 #transform ll an ch to l{ and c{ ({ comes after z)
 str - gsub(ll, l{, gsub(ch, c{, str))
 order(str)
}

NKFD - function(x) {
 system(paste(echo -en '# coding=utf-8\nimport unicodedata\nfor  
i,v in enumerate([\ ,  paste(x, collapse=\, \),  \]):print  
unicodedata.normalize(\NFKD\,unicode(v,  
\UTF-8\)).encode(\UTF-8\)'|python -,  sep=), intern=T)
}

Notes to NFKD rountine:
- only works if R's environment is set to UTF-8!
- for instance a Danish ø won't be decompose to o / (these cases has  
to be solved manually)
- this routine is not very fast


Cheers,

--Hans

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a question of alphabetical order

2008-04-16 Thread [Ricardo Rodriguez] Your XEN ICT Team
Thanks Hans!

Hans-Joerg Bibiko wrote:
 Hi,

 as already mentioned, sorting could be a pain.

 My solution to that is to write my own order routine for a given 
 language.
 The idea is to transform the UTF-8 string into ASCII in such a way 
 that the built-in order routine outputs the desired result. But this 
 could be a very stony way.

 Example for Spanish (please correct me if I'm wrong):
 -accents are ignored

Correct.
 -ll is one single entity and comes after l (ludar comes before llave)

Nope. Nowadays, it is considered as two letters (I mean your use of 
entity here allows me to say two different entities). Thus, here an 
example...

lama, lazo, leve, llave, lluvia, ludar (if such a word exists), luna
 -ch is one single entity and comes after c

Nope. Here another example:

capa, casco, chapa, chepa, cisma, copa, curva

Here the original source (in Spanish)...

http://tinyurl.com/ysm243


 The only thing I do not know if it could happen that a 'll' is not one 
 entity but two (maybe the result of the combination of two nouns). If 
 so then the entire story will be much more complicated.

I think this is the case both for ll, ch and even rr although rr was 
never considered as a single entity

What I don't know is how Spanish locales consider these rules. I'll try 
to understand it and keep this thread (or a follow up created in the 
r-sig-mac list) updated.

 Now the big question is how to delete all these accents in åàÿñü etc. 
 to get aaynu. (technically spoken canonical decomposition of a Unicode 
 string NFKD)
 One possible way is to use a scripting language which can handle it. 
 The only language I know  which can do it as default is python. For 
 ruby, perl one has to install an additional library.

 On a Mac system python is installed as default; on Windows not. If 
 this ordering is also an issue for Windows users then one has to 
 install it in beforehand.

 The code comes here:

 orderES - function(x) {
 #decomposes all accented characters
 str - NKFD(x)

 #all combining diacritics
 nonChars - c(768:879)
 pattern - paste([, intToUtf8(as.integer(nonChars)), ], sep=)

 #delete all combining diacritics
 str - gsub(pattern, , str)

 #transform ll an ch to l{ and c{ ({ comes after z)
 str - gsub(ll, l{, gsub(ch, c{, str))
 order(str)
 }

 NKFD - function(x) {
 system(paste(echo -en '# coding=utf-8\nimport unicodedata\nfor 
 i,v in enumerate([\ ,  paste(x, collapse=\, \),  \]):print 
 unicodedata.normalize(\NFKD\,unicode(v, 
 \UTF-8\)).encode(\UTF-8\)'|python -,  sep=), intern=T)
 }

 Notes to NFKD rountine:
 - only works if R's environment is set to UTF-8!
 - for instance a Danish ø won't be decompose to o / (these cases has 
 to be solved manually)
 - this routine is not very fast


 Cheers,

 --Hans

I don't know if this applies only to Mac or is a general issue. In any 
case, as I am working with Mac now, I will move the discussion to the 
r-sig-mac list as proposed by Brian Ripley. Do you agree?

See you there!

Greetings,

Ricardo


-- 
Ricardo Rodríguez
Your XEN ICT Team

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a question of alphabetical order

2008-04-16 Thread [Ricardo Rodriguez] Your XEN ICT Team
Hans-Joerg Bibiko wrote:
 Hola,

 Muchas gracias!
 This is new to me. I learnt Spanish a bit - well - 20 years ago ;)
 But this simplifies it. 
This change happens just 14 years ago! You you are not guilty!


 Recuerdos

 Hans


Saludos cordiales! Read you in Spanish whenever you want!

Ricardo

-- 
Ricardo Rodríguez
Your XEN ICT Team

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] a question of alphabetical order

2008-04-15 Thread [Ricardo Rodriguez] Your XEN ICT Team
Hi all,

In Spanish vowels with accent like á, é, ... doesn't affect to the 
alphabetical order of vector of strings. I mean, a or á don't matter for 
establishing the alphabetical order.

Nevertheless, while working with R order, here is what I get.

Given a file transport.txt

medio#variable
avión#34
barco#33
bicicleta#3
ángulo#37
camión#54
coche#23
tren#67

  toPlot - 
read.csv(~/Desktop/Workplace/transport.txt,header=TRUE,sep=#)
  toPlot[order(toPlot$medio),]
  medio variable
1 avión   34
2 barco   33
3 bicicleta3
5camión   54
6 coche   23
7  tren   67
4ángulo   37
 

I expect ángulo appears in the first place as n (in ángulo) goes before 
v (in avión) and á/a doesn't matter for alphabetical order.

But ángulo appears in the last position.

Here my environment:

  sessionInfo()
R version 2.7.0 beta (2008-04-12 r45280)
i386-apple-darwin9.2.2

locale:
es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base
  version
   _  
platform   i386-apple-darwin9.2.2 
arch   i386   
os darwin9.2.2
system i386, darwin9.2.2  
status beta   
major  2  
minor  7.0
year   2008   
month  04 
day12 
svn rev45280  
language   R  
version.string R version 2.7.0 beta (2008-04-12 r45280)
 

Is it not possible to get this dataframe ordered correctly in Spanish? 
Other programs (Excel, for instance) do order correctly.

Thanks for your help,

Ricardo

-- 
Ricardo Rodríguez
Your XEN ICT Team

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a question of alphabetical order

2008-04-15 Thread Prof Brian Ripley
This is a known Mac OS X bug, nothing to do with R which uses the system 
functions (strcoll/wcscoll) for such things.


If you look at the help for sort, it refers you to ?Comparison.  Which 
says


 Comparison of strings in character vectors is lexicographic within
 the strings using the collating sequence of the locale in use: see
 'locales'.  The collating sequence of locales such as 'en_US' is
 normally different from 'C' (which should use ASCII) and can be
 surprising.  Beware of making _any_ assumptions about the
 collation order: e.g. in Estonian 'Z' comes between 'S' and 'T',
 and collation is not necessarily character-by-character - in
 Danish 'aa' sorts as a single letter, after 'z'.  Some platforms
 may not respect the locale and always sort in ASCII.  (String
 comparison is always for the part of the string up to the first
 nul if there are embedded nuls.)

Mac OS X (more specifically, 10.5.2 on i386) is one of those disrespectful 
platforms.



x - intToUtf8(c(32:127, 160:255), multiple=T)
order(x)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71 
72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89 
90
 [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 
108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 
126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 
144
[145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 
162
[163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 
180

[181] 181 182 183 184 185 186 187 188 189 190 191 192

which is quite different from Linux or Solaris.  This may not come out, 
but paste(sort(x), collapse=) includes


aAªáÁàÀâÂåÅäÄãÃæÆbBcCçÇdDeEéÉèÈêÊëË

on Linux in es_ES.utf8 .

Platforms are a lot worse at sorting in UTF-8 than 8-bit encodings.  Mac 
OS X has es_ES.ISO8859-15, and that does do a reasonable job including 
aáàâåäãæ .


On Tue, 15 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote:


Hi all,

In Spanish vowels with accent like á, é, ... doesn't affect to the
alphabetical order of vector of strings. I mean, a or á don't matter for
establishing the alphabetical order.

Nevertheless, while working with R order, here is what I get.

Given a file transport.txt

medio#variable
avión#34
barco#33
bicicleta#3
ángulo#37
camión#54
coche#23
tren#67

 toPlot -
read.csv(~/Desktop/Workplace/transport.txt,header=TRUE,sep=#)
 toPlot[order(toPlot$medio),]
 medio variable
1 avión   34
2 barco   33
3 bicicleta3
5camión   54
6 coche   23
7  tren   67
4ángulo   37


I expect ángulo appears in the first place as n (in ángulo) goes before
v (in avión) and á/a doesn't matter for alphabetical order.

But ángulo appears in the last position.

Here my environment:

 sessionInfo()
R version 2.7.0 beta (2008-04-12 r45280)
i386-apple-darwin9.2.2

locale:
es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base
 version
  _
platform   i386-apple-darwin9.2.2
arch   i386
os darwin9.2.2
system i386, darwin9.2.2
status beta
major  2
minor  7.0
year   2008
month  04
day12
svn rev45280
language   R
version.string R version 2.7.0 beta (2008-04-12 r45280)


Is it not possible to get this dataframe ordered correctly in Spanish?
Other programs (Excel, for instance) do order correctly.

Thanks for your help,

Ricardo

--
Ricardo Rodríguez
Your XEN ICT Team

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a question of alphabetical order

2008-04-15 Thread [Ricardo Rodriguez] Your XEN ICT Team

Tricky question, this order issue :-(

Thank you so much for the detailed explanation.

Thus, please, must I conclude that I will have to survive with this 
ASCII order while working in Mac OS X 10.5.2 until Mac people fix this bug?

You spoke about es_ES.ISO8859-15 in Mac. Will it do the trick? Yes, as 
far as I understand. But as I am using R.app, locale is set by the 
system preferences. Truly, I am kind of a mess with this issue.

Could I force es_ES.ISO8859-15 as a locale in the Mac.

Sorry of I put another question here... why does Excel order list 
correctly? I guess it doesn't relies on Mac settings.

As a R newbie I must recognize that this, and others, behaviours are 
really hard to deal with. But I've seen, an even done, such an amount of 
wonderful things with R that it is worth all efforts. Thanks for your help.

All the best,

Ricardo


Prof Brian Ripley wrote:
 This is a known Mac OS X bug, nothing to do with R which uses the 
 system functions (strcoll/wcscoll) for such things.

 If you look at the help for sort, it refers you to ?Comparison.  Which 
 says

  Comparison of strings in character vectors is lexicographic within
  the strings using the collating sequence of the locale in use: see
  'locales'.  The collating sequence of locales such as 'en_US' is
  normally different from 'C' (which should use ASCII) and can be
  surprising.  Beware of making _any_ assumptions about the
  collation order: e.g. in Estonian 'Z' comes between 'S' and 'T',
  and collation is not necessarily character-by-character - in
  Danish 'aa' sorts as a single letter, after 'z'.  Some platforms
  may not respect the locale and always sort in ASCII.  (String
  comparison is always for the part of the string up to the first
  nul if there are embedded nuls.)

 Mac OS X (more specifically, 10.5.2 on i386) is one of those 
 disrespectful platforms.

 x - intToUtf8(c(32:127, 160:255), multiple=T)
 order(x)
   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  
 17 18
  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  
 35 36
  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  
 53 54
  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  
 71 72
  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  
 89 90
  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 
 107 108
 [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 
 125 126
 [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 
 143 144
 [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
 161 162
 [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
 179 180
 [181] 181 182 183 184 185 186 187 188 189 190 191 192

 which is quite different from Linux or Solaris.  This may not come 
 out, but paste(sort(x), collapse=) includes

 aAªáÁàÀâÂåÅäÄãÃæÆbBcCçÇdDeEéÉèÈêÊëË

 on Linux in es_ES.utf8 .

 Platforms are a lot worse at sorting in UTF-8 than 8-bit encodings.  
 Mac OS X has es_ES.ISO8859-15, and that does do a reasonable job 
 including aáàâåäãæ .


-- 
Ricardo Rodríguez
Your XEN ICT Team

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a question of alphabetical order

2008-04-15 Thread [Ricardo Rodriguez] Your XEN ICT Team
Almost done...

Sys.setlocale(category = LC_ALL, locale = es_ES.ISO8859-15)

The order is now correct, but it renders incorrectly most of the 
non-ASCII characters, both in console:

1 √\201guilas de mantenimiento 1.97 NA 1.72
2 √\201ngeles de la CONAGUA 1.77 1.97 1.94

And in quartz():

http://mire.environmentalchange.net/~webmaster/images/MexRenderErrors.png

Well, the solution seems to be to set order with a locale, and to create 
the output with the other, is this possible?

Thanks!

Ricardo

[Ricardo Rodriguez] Your XEN ICT Team wrote:
 Tricky question, this order issue :-(

 Thank you so much for the detailed explanation.

 Thus, please, must I conclude that I will have to survive with this 
 ASCII order while working in Mac OS X 10.5.2 until Mac people fix this bug?

 You spoke about es_ES.ISO8859-15 in Mac. Will it do the trick? Yes, as 
 far as I understand. But as I am using R.app, locale is set by the 
 system preferences. Truly, I am kind of a mess with this issue.

 Could I force es_ES.ISO8859-15 as a locale in the Mac.

 Sorry of I put another question here... why does Excel order list 
 correctly? I guess it doesn't relies on Mac settings.

 As a R newbie I must recognize that this, and others, behaviours are 
 really hard to deal with. But I've seen, an even done, such an amount of 
 wonderful things with R that it is worth all efforts. Thanks for your help.

 All the best,

 Ricardo
-- 
Ricardo Rodríguez
Your XEN ICT Team

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a question of alphabetical order

2008-04-15 Thread Prof Brian Ripley

On Wed, 16 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote:


Almost done...

Sys.setlocale(category = LC_ALL, locale = es_ES.ISO8859-15)

The order is now correct, but it renders incorrectly most of the non-ASCII 
characters, both in console:


1 √\201guilas de mantenimiento 1.97 NA 1.72
2 √\201ngeles de la CONAGUA 1.77 1.97 1.94

And in quartz():

http://mire.environmentalchange.net/~webmaster/images/MexRenderErrors.png

Well, the solution seems to be to set order with a locale, and to create the 
output with the other, is this possible?


Yes, but only for the same character set.  I believe R.app assumes UTF-8, 
and I would not expect to be able to change charset on a running console.


Please do use R-sig-mac for MacOS-specific issues.



Thanks!

Ricardo

[Ricardo Rodriguez] Your XEN ICT Team wrote:

Tricky question, this order issue :-(

Thank you so much for the detailed explanation.

Thus, please, must I conclude that I will have to survive with this ASCII 
order while working in Mac OS X 10.5.2 until Mac people fix this bug?


You spoke about es_ES.ISO8859-15 in Mac. Will it do the trick? Yes, as far 
as I understand. But as I am using R.app, locale is set by the system 
preferences. Truly, I am kind of a mess with this issue.


Could I force es_ES.ISO8859-15 as a locale in the Mac.

Sorry of I put another question here... why does Excel order list 
correctly? I guess it doesn't relies on Mac settings.


As a R newbie I must recognize that this, and others, behaviours are really 
hard to deal with. But I've seen, an even done, such an amount of wonderful 
things with R that it is worth all efforts. Thanks for your help.


All the best,

Ricardo

--
Ricardo Rodríguez
Your XEN ICT Team



--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.