Re: Unicode::Collate question

2003-12-01 Thread Eric Cholet
Le 29 nov. 03, à 16:30, Jarkko.Hietaniemi a écrit :

I want to correctly sort words in a variety of languages, currently
French, English, Spanish, Portuguese, German and Arabic. I am using
Perl 5.8.1 and unicode. I think I need Unicode::Collate to have
*correct* sorting. Is this correct?
In addition to the problems listed by Sadahiro (most importantly that 
the UCA
is not correct for any particular language, it is just a baseline 
ordering
that is used for Unicode character data) I think it is worth pointing 
out that
trying to sort multilingual data is practically doomed to fail sooner 
or later
because many language-specific rules simply are contradictory.
Thank you both for your replies. What about sorting words in one 
particular
language, is Perl's sort() good enough? I'm wondering, since language 
isn't
one of sort()'s arguments.

--
Eric Cholet


Re: Unicode::Collate question

2003-12-01 Thread Jarkko Hietaniemi
Thank you both for your replies. What about sorting words in one 
particular
language, is Perl's sort() good enough? I'm wondering, since language 
isn't
one of sort()'s arguments.
First we need to define good enough... again, if you are sorting
simple English or Hawaiian, you are probably fine.  But as soon
as your words contain real-life complications like
	- letters like  or or  or  or ...
	- beyond-Latin-1-letters like  or  or  or  or  or  or ...	- 
peoples' names
	- acronyms and the like
	- do all the characters matter or just the letters
	- sorting mixed letters and digits
	- Roman numbers

you are on your own.  For the first item the use of the locale pragma 
can help
as long as your data is 8-bit and in one locale.  As soon as data 
becomes Unicode,
Perl will as far as I know ignore localeness for sorting.

If you find yourself wanting some complex sorting, look into CPAN, what 
you
can find from search.cpan.org with sort, for example Sort::ArbBiLex 
might
be useful.

--
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this 
special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen




Re: Unicode::Collate question

2003-12-01 Thread Jarkko Hietaniemi
Ok, this is in line with what how I understood this paragraph in  
perluniintro:

   The short answer is that by default, Perl compares strings  
(lt,
   le, cmp, ge, gt) based only on the code points of  
the char-
   acters.  In the above case, the answer is after, since  
0x00C1 
   0x00C0.

So is it just by chance that these French words are accurately sorted?
I think a qualified yes here is in order...

% perl -Mutf8 -e 'binmode(STDOUT, :utf8); print join  , sort  
qw(côte côté cote coté)'
cote coté côte côté
Is this the famous French backwards accents rule in action?
(http://www-clips.imag.fr/geta/gilles.serasset/tri-du-francais.html)
(no, I don't speak French)
But in this case, with those particular words, I think ISO Latin 1 (none
of the characters are beyond ISO Latin 1) just happens to work right.
o  ô, and e  é.
Some more links (database related since they have had to think about  
these things
for years already) that hopefully explain some of the problems related  
to linguistic sorting:

http://www.engin.umich.edu/caen/wls/software/oracle/server.901/a90236/ 
ch4.htm
http://developer.mimer.com/documentation/html_92/ 
Mimer_SQL_Engine_DocSet/Mimer_Concepts14.html

Thanks,
--
Eric Cholet

--
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this  
special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen