Dan Sugalski <[EMAIL PROTECTED]>
> I've also been told that the problem even exists in Western European
> languages--some languages consider accented (or umlauted, or tilde'd, or
> whatever) characters different from the un-accented version, and some
> don't. And in some cases two different languages will sort the same mix of
> accented and unaccented characters differently. (I can't pull an example
> out of the air at the moment, so I might be wrong here. I'm not familiar
> with the character sorting schemes for all the languages in Western Europe,
> so I'm taking this on faith)
It's better than that. AIUI, in (European) Spanish, the letter
sequence 'ch' sorts between 'c' and 'd', whereas in American Spanish
it is sorted as a 'c' followed by an 'h'.
> Anyway, there you go. To completely represent a string you need lots of
> parts. The reference for the bits is:
>
> a series of code points: Gotta have the raw data
Really? I suppose it depends on what you think of as the raw data.
For text, it is typically the characters (not the encoding of them)
that is the raw data, and for binary data it is the bytes (which
should not be encoded) which are the raw data. :^)
> A character set: It helps to know what character 12 actually *is*
Do you mean a charset here? They are (apparently) different to
character sets (i18n terminology is terminally confusing, I know.)
> An encoding: So we can pick characters out of the raw data. (Helps to know
> how big a character is...)
This would record the difference between UCS16 and UTF8, for example?
> A language: So we can properly interpret the data in those cases where we
> care, generally for comparison and sorting
Should be a locale, not just a language (as my example above shows.)
There is another complication in that the correct collation order for
strings might even be domain-specific (e.g. phone books.)
> Length in bytes: So we know how much raw data we have
>
> Length in code points: Because knowing this is nice too
You definitely need these two. Especially if you have any
variable-width encodings lurking around...
> Length in Glyphs: This one's still up in the air (we might not do it), but
> it's nice in those cases where multiple code points collapse into a single
> glyph on-screen
I'd advise staying away from that unless you want to get involved in
the world-of-pain that is text-rendering in South and South-East Asian
languages. (Arabic, Devanagari and Thai are good examples of how
nasty things can get.)
> For those folks who've made it this far and are starting (or continuing) to
> froth over efficiency, I'll point out that for most of the string work that
> the interpreter (any interpreter) needs to do it can, if things match (and
> we'll make sure they do), treat the character data as a stream of n-byte
> characters. When doing an exact string match with the regex engine, for
> example, it doesn't really care what a character means as long as it's the
> same. And sets of characters (word, digit, whitespace, whatever) are just
> sets of characters--as long as it's got the set that matches the encoding
> of the RE and the string to be searched, it's happy. They're all just a
> bunch of bits after all.
For much string processing, all that really matters is having the
characters so that each is represented in a fixed-width encoding (Tcl
8.1 used UTF8 only internally, and its performance stank; that version
is unsupported these days[*]) and there are no bizarre shift states
floating around (because you don't want to have to pre-multiply your
RE engines by the number of different shift states that some deranged
fool has put in their encoding.)
Donal.
[* Anyone says they're using it, we instruct them to upgrade. ]
--
Donal K. Fellows, Department of Computer Science, University of Manchester, UK.
(work) [EMAIL PROTECTED] Tel: +44-161-275-6137 (preferred email addr.)
(home) [EMAIL PROTECTED] Tel: +44-1274-401017 Mobile: +44-7957-298955
http://www.cs.man.ac.uk/~fellowsd/ (Don't quote my .sig; I've seen it before!)