At 01:31 PM 9/21/2001 +0100, Donal K. Fellows wrote:
>Dan Sugalski <[EMAIL PROTECTED]>
> > I've also been told that the problem even exists in Western European
> > languages--some languages consider accented (or umlauted, or tilde'd, or
> > whatever) characters different from the un-accented version, and some
> > don't. And in some cases two different languages will sort the same mix of
> > accented and unaccented characters differently. (I can't pull an example
> > out of the air at the moment, so I might be wrong here. I'm not familiar
> > with the character sorting schemes for all the languages in Western
> Europe,
> > so I'm taking this on faith)
>
>It's better than that. AIUI, in (European) Spanish, the letter
>sequence 'ch' sorts between 'c' and 'd', whereas in American Spanish
>it is sorted as a 'c' followed by an 'h'.
Oh, joy. :) Getting a good set of text sorting routines is going to be so
much fun...
> > Anyway, there you go. To completely represent a string you need lots of
> > parts. The reference for the bits is:
> >
> > a series of code points: Gotta have the raw data
>
>Really? I suppose it depends on what you think of as the raw data.
>For text, it is typically the characters (not the encoding of them)
>that is the raw data, and for binary data it is the bytes (which
>should not be encoded) which are the raw data. :^)
Hey, binary data's data too, y'know! ;-P Granted, eight-bit fixed-width
data, but that's OK.
> > A character set: It helps to know what character 12 actually *is*
>
>Do you mean a charset here? They are (apparently) different to
>character sets (i18n terminology is terminally confusing, I know.)
Well, I meant "Is the data Unicode, Shift-JIS, ASCII, EBCDIC, raw
binary...." Character set's probably the wrong terminology, but I'm not
sure if there's a right one. (Or, rather, which of the Definitely Correct
ones should be used)
> > An encoding: So we can pick characters out of the raw data. (Helps to know
> > how big a character is...)
>
>This would record the difference between UCS16 and UTF8, for example?
Yes.
> > A language: So we can properly interpret the data in those cases where we
> > care, generally for comparison and sorting
>
>Should be a locale, not just a language (as my example above shows.)
>There is another complication in that the correct collation order for
>strings might even be domain-specific (e.g. phone books.)
I avoided the locale word because it's so loaded. The language (or whatever
we call it) for a string wouldn't affect how numbers or dates are formatted.
As for domain-specific sorting, that's outside what this is worried about.
The default sort would use the info here, but special-purpose sorts would
presumably be, well, more special-purpose, in which case the programmer
writing the sort could do whatever they thought was best.
> > Length in Glyphs: This one's still up in the air (we might not do it), but
> > it's nice in those cases where multiple code points collapse into a single
> > glyph on-screen
>
>I'd advise staying away from that unless you want to get involved in
>the world-of-pain that is text-rendering in South and South-East Asian
>languages. (Arabic, Devanagari and Thai are good examples of how
>nasty things can get.)
I was thinking of this in terms of Unicode combining characters, where
multiple code points got glommed together into a single displayed
character. I'm all for avoiding pain, though.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk