At 01:31 PM 9/21/2001 +0100, Donal K. Fellows wrote:
>Dan Sugalski <[EMAIL PROTECTED]>
> > I've also been told that the problem even exists in Western European
> > languages--some languages consider accented (or umlauted, or tilde'd, or
> > whatever) characters different from the un-accented version, and some
> > don't. And in some cases two different languages will sort the same mix of
> > accented and unaccented characters differently. (I can't pull an example
> > out of the air at the moment, so I might be wrong here. I'm not familiar
> > with the character sorting schemes for all the languages in Western 
> Europe,
> > so I'm taking this on faith)
>
>It's better than that.  AIUI, in (European) Spanish, the letter
>sequence 'ch' sorts between 'c' and 'd', whereas in American Spanish
>it is sorted as a 'c' followed by an 'h'.

Oh, joy. :) Getting a good set of text sorting routines is going to be so 
much fun...

> > Anyway, there you go. To completely represent a string you need lots of
> > parts. The reference for the bits is:
> >
> > a series of code points: Gotta have the raw data
>
>Really?  I suppose it depends on what you think of as the raw data.
>For text, it is typically the characters (not the encoding of them)
>that is the raw data, and for binary data it is the bytes (which
>should not be encoded) which are the raw data.  :^)

Hey, binary data's data too, y'know! ;-P Granted, eight-bit fixed-width 
data, but that's OK.

> > A character set: It helps to know what character 12 actually *is*
>
>Do you mean a charset here?  They are (apparently) different to
>character sets (i18n terminology is terminally confusing, I know.)

Well, I meant "Is the data Unicode, Shift-JIS, ASCII, EBCDIC, raw 
binary...." Character set's probably the wrong terminology, but I'm not 
sure if there's a right one. (Or, rather, which of the Definitely Correct 
ones should be used)

> > An encoding: So we can pick characters out of the raw data. (Helps to know
> > how big a character is...)
>
>This would record the difference between UCS16 and UTF8, for example?

Yes.

> > A language: So we can properly interpret the data in those cases where we
> > care, generally for comparison and sorting
>
>Should be a locale, not just a language (as my example above shows.)
>There is another complication in that the correct collation order for
>strings might even be domain-specific (e.g. phone books.)

I avoided the locale word because it's so loaded. The language (or whatever 
we call it) for a string wouldn't affect how numbers or dates are formatted.

As for domain-specific sorting, that's outside what this is worried about. 
The default sort would use the info here, but special-purpose sorts would 
presumably be, well, more special-purpose, in which case the programmer 
writing the sort could do whatever they thought was best.

> > Length in Glyphs: This one's still up in the air (we might not do it), but
> > it's nice in those cases where multiple code points collapse into a single
> > glyph on-screen
>
>I'd advise staying away from that unless you want to get involved in
>the world-of-pain that is text-rendering in South and South-East Asian
>languages.  (Arabic, Devanagari and Thai are good examples of how
>nasty things can get.)

I was thinking of this in terms of Unicode combining characters, where 
multiple code points got glommed together into a single displayed 
character. I'm all for avoiding pain, though.


                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to