-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Simon Slavin wrote:
> Your descriptions make perfect sense and are very  
> interesting since ICU is a good attempt to get around one of the  
> fundamental problems of Unicode.

Errr, this is not the fault of Unicode.  It is the fault of people!  Unicode
lets you represent the majority of the world's past and present characters
using the same character set.  Note that there is a lot of debate over
exactly what constitutes a character, ways they combine code points, the
same code point being used for different native character sets, dealing with
older text where the character depiction matters even if it the "same" as a
modern character.  Unicode is a reasonable compromise.  See
http://en.wikipedia.org/wiki/Unicode#Issues

Sorting and comparing strings are hard.  For example someone in the US or UK
would consider cafe and café to be equivalent.  German has a different
ordering for looking in a phonebook versus a dictionary.  What do you do
about a German user having a Swedish name in their phonebook?  Is it sorted
using Swedish rules or German rules?  Unicode is not required to sort and
compare strings, but it is a lot nicer place to start.  And then the folks
at the Unicode consortium who have been thinking about this for a very long
time have come up with an algorithm that works (with locale specific
adjustments) called the Unicode Collation Algorithm.  Their report gives you
a good idea of the complexity and issues involved.  Section 1.8 is enlightening.

 http://www.unicode.org/unicode/reports/tr10/

ICU is a programming library implementing UCA plus a few other things.  It
is large and slow because of people, needing all sorts of builtin tables
such as how each locale sorts things like accents and combining characters
as well as ordinary codepoints commonly used across multiple locales:

  http://en.wikipedia.org/wiki/International_Components_for_Unicode

You likely didn't intend your comment to be taken as condescending towards
Unicode/UCA/ICU but I did want to make it *very* clear that they make life
considerably easier for us as programmers dealing with human text and
provide solutions to collation/case etc that we frequently need.  It is far
more than a "good attempt", closer to a very good solution.  There aren't
any alternatives that come *remotely* close as using the examples in the UCA
report will show you.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq0P0QACgkQmOOfHg372QSz9ACggmw5kaLKwL90nggbr0GaTxkZ
SNMAn17gWLmy3SdbzZVMI6fSoUtTVmYS
=jOGK
-----END PGP SIGNATURE-----
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to