Re: Unicode::Collate question
Le 29 nov. 03, à 16:30, Jarkko.Hietaniemi a écrit : I want to correctly sort words in a variety of languages, currently French, English, Spanish, Portuguese, German and Arabic. I am using Perl 5.8.1 and unicode. I think I need Unicode::Collate to have *correct* sorting. Is this correct? In addition to the problems listed by Sadahiro (most importantly that the UCA is not correct for any particular language, it is just a baseline ordering that is used for Unicode character data) I think it is worth pointing out that trying to sort multilingual data is practically doomed to fail sooner or later because many language-specific rules simply are contradictory. Thank you both for your replies. What about sorting words in one particular language, is Perl's sort() good enough? I'm wondering, since language isn't one of sort()'s arguments. -- Eric Cholet
Re: Unicode::Collate question
Thank you both for your replies. What about sorting words in one particular language, is Perl's sort() good enough? I'm wondering, since language isn't one of sort()'s arguments. First we need to define good enough... again, if you are sorting simple English or Hawaiian, you are probably fine. But as soon as your words contain real-life complications like - letters like or or or or ... - beyond-Latin-1-letters like or or or or or or ... - peoples' names - acronyms and the like - do all the characters matter or just the letters - sorting mixed letters and digits - Roman numbers you are on your own. For the first item the use of the locale pragma can help as long as your data is 8-bit and in one locale. As soon as data becomes Unicode, Perl will as far as I know ignore localeness for sorting. If you find yourself wanting some complex sorting, look into CPAN, what you can find from search.cpan.org with sort, for example Sort::ArbBiLex might be useful. -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: Unicode::Collate question
Ok, this is in line with what how I understood this paragraph in perluniintro: The short answer is that by default, Perl compares strings (lt, le, cmp, ge, gt) based only on the code points of the char- acters. In the above case, the answer is after, since 0x00C1 0x00C0. So is it just by chance that these French words are accurately sorted? I think a qualified yes here is in order... % perl -Mutf8 -e 'binmode(STDOUT, :utf8); print join , sort qw(côte côté cote coté)' cote coté côte côté Is this the famous French backwards accents rule in action? (http://www-clips.imag.fr/geta/gilles.serasset/tri-du-francais.html) (no, I don't speak French) But in this case, with those particular words, I think ISO Latin 1 (none of the characters are beyond ISO Latin 1) just happens to work right. o ô, and e é. Some more links (database related since they have had to think about these things for years already) that hopefully explain some of the problems related to linguistic sorting: http://www.engin.umich.edu/caen/wls/software/oracle/server.901/a90236/ ch4.htm http://developer.mimer.com/documentation/html_92/ Mimer_SQL_Engine_DocSet/Mimer_Concepts14.html Thanks, -- Eric Cholet -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen