Since this has been a sore spot lately, and one we need to deal with. Might as well formally define what that is.

We must be able to:

*) Load in string data from an IO source, regardless of its encoding, and treat it as Unicode string data
*) write string data to an IO source in any Unicode encoding
*) Collate strings per the Unuicode standard
*) Convert non-Unicode string data to Unicode properly (that is, obeying the Unicode conversion rules)
*) Treat combining characters the same regardless of whether they're composed or decomposed


We don't care about on-screen rendering or date/time/money formatting.

So, basically, we need to be able to read in data regardless of whether it's UTF-8, UTF-16, or UTF-32 encoded, and when we have it we should be able to properly match "o" against "o" and not "ö" (that's o with an umlaut over it) regardless of whether the "ö" is composed (that is, one codepoint) or decomposed (that is, two code points), and then write it out to some IO handle in proper UTF-8/16/32 format. When comparing two Unicode strings we must be able to do so properly, per the Unicode collation standard. (With potential local overrides if we ever put those in) We must also be able to case-mangle (that is, upcase, downcase, or titlecase) the string.

Additionally if we have source text which is Latin-n, EBCDIC, ASCII, or whatever we must be able to convert it with no loss to Unicode. (Which I believe is now doable with Unicode 4.0) Losslessly converting Unicode to ASCII/EBCDIC/whatever is *not* required, which is fine as it's theoretically (and often practically) impossible.

I think that's it. Spelling it out's made the encoding and charset API clear. I'll type that in and get it off next.
--
Dan


--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to