Re: Unicode sorting...

Bryan C . Warnock Fri, 08 Jun 2001 14:12:05 -0700
On Friday 08 June 2001 02:17 pm, NeonEdge wrote:
> > Another example is the chinese has no definite
> > sorting order, period. The commonly used scheme are
> > phonetic-based or stroke-based. Since many characters
> > have more than one pronounciations (context sensitive)
> > and more than one forms (simplified and traditional).
> > So if we have a mix content from china and taiwan, it
> > is impossible to sort in a way everyone will feel happy
>
> If this is the case, how would a regex like "^[a-zA-Z]" work (or other,
> more sensitive characters)? If just about anything can come between A and
> Z, and letters that might be there in a particular locale aren't in
> another locale, then how will regex engine make the distinction? Will it
> have to create it's own locale-specific character table?
> Grant M.
> (is it just me, or is this looking more and more painful).

This is why I've been thinking that locales should be in bed with string 
manglers.  Sans locale, ranges would be defined strictly by the underlying 
representation, except perhaps for EBCDIC (for legacy reasons - recently 
hashed on on p5p).  Locales, however, would define how their own ranges 
work.

Locales could probably even reference other locales.  Farsi, for example, 
may define a sort order (and with it, a character range) to reconstitute its 
alphabet from the Arabic set, but define the English locale to handle 
everything it doesn't.  (So that Farsi would sort Farsi according to Farsi 
rules, embedded English text would be handled as if it were English text, 
and anything that English doesn't cover, would be handled per the standard 
Uncode rules.  That would have the added bonus of actually having English 
sorting, which we currently don't have.)

You're still going to general be limited to character sorting, so no 
dictionary lookups to determine what may be dictionary order in a native 
language.  (If such a thing is defined.  Arabic has two, a native and a 
western ordering of the language.  The native is more straightforward, 
although I believe collation and sorting of words is done without the 
grammatical prefixes and suffixes.  The western would probably require 
dictionary lookups, which, in turn, would probably require true language 
parsing.)

Yes, this has tremedous potential (for pain *and* pleasure), but the idea 
should be to foist the burden only on the folks that want it.  (And since I 
directly support text mungers, I'm one of them.)

-- 
Bryan C. Warnock
[EMAIL PROTECTED]
Re: Unicode sorting...

Reply via email to