> > I can't really believe that this would be a problem, but if they're
> > integrated alphabets from different locales, will there be issues
> > with sorting (if we're not planning to use the locale)? Are there
> > instances where like characters were combined that will affect the
> > sort orders?
> 
> Yes, it is an issue.  In the general case, you CANNOT sort strings of
> several locales/languages into a single order that would satisfy all
> of the locales/languages.  One often quoted example is German and
> Swedish/Finnish: the LATIN CAPITAL LETTER A WITH RING ABOVE comes
> between A and B in the former but after Z (not immediately, but
> doesn't matter here) in the latter.  Similarly for all the accented
> alphabetic characters, the rules how they are sorted differ from one
> place to another , and many languages have special combinations like
> ch, ss, ij that require special attention.

My understanding is there is NO general unicode sorting, period.

The most useful one must be locale-sensitive, as defined by unicode
collation. In practice, the story is even worse. For example, how do
you sort strings comming from different locales, say I have an address
book with names from all over the world. Which locale I should use
to sort the names. Another example is the chinese has no definite
sorting order, period. The commonly used scheme are phonetic-based
or stroke-based. Since many characters have more than one pronounciations
(context sensitive) and more than one forms (simplified and traditional).
So if we have a mix content from china and taiwan, it is impossible to
sort in a way everyone will feel happy. Also Chinese is space insensitive.
In English, we have to use space to separate words. But in Chinese,
there is no lexical words, only linguistic words. You can insert space
between any two chinese characters without change their meaning.

I heard a rumor long time ago, the unicode consortium was working on
a locale independent collation, which can be used to sort mix content.
As for Perl, I like to have several basic sortings:
a) binary sorting
b) locale independent general sort
c) locale-sensitive sort based on unicode collation

We could have more if possible. The general sort can be done by
canonicalize all strings, remove case info, remove diacritics,
remove font/width, then use binary sort.

Hong

Reply via email to