> * Convert from and to UTF-32
> * lengths in bytes, characters, and possibly glyphs
> * character size (with the variable length ones reporting in negative
numbers)

What do you mean by character size if it does not support variable length?

> * get and set the locale (This might not be the spot for this)

The locale should be context based. Each thread should have its own
locale.

> * normalize (a noop for non-Unicode data)
> * Get the encoding name

The encoding name is tricky. Neither Java or POSIX defines their
naming scheme. I personally prefer full name with lower case,
such as "iso8859-1", the API converts name to lower automatically.
The encoding name must be strict ASCII. Some common aliases
may be provided. There must be an API to list all supported encoding
during runtime.

> * Do a substr operation by character and glyph

The byte based is more useful. I have utf-8, and I want to substr it
to another utf-8. It is painful to convert it or linear search for
charaacter
position.

> I don't know if we want to treat encoding and data format separately--it 
> would seem to make sense to be able to have a string tell us it's 
> Unicode/UTF-32/Korean rather than just UTF-32/Korean, since I 
> don't see why it wouldn't be allowable to use the UTF-8 or UTF-16 encoding

> on non-Unicode data. (Not that it'd necessarily be all that useful, and I 
> can see just not allowing it)

I don't see the core should support language/locale in this detail.
I deal a lot of mix chinese/english text file. There is no way to represent
it using plain string, unless you want to make string be rich-format-text
-buffer. Current locale or explicit locale parameter will suffice your goal.

Hong

Reply via email to