> On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
> > Personally I like the UTF-8 encoding. The solution to the
> > variable length can be handled by a special (virtual)
> > function like
> 
> I'm expecting that the virtual, internal representation will not
> be in a UTF but will simply be an array of codepoints. Manipulating
> UTF8 internally is horrible because it's a variable length encoding,
> so you need to keep track of where you are both in terms of characters
> and bytes. Yuck, yuck, yuck.

I am not sure if you have read through my email.

The concept of characters have nothing to do with codepoints.
Many characters are composed by more than one codepoints.

The concept of character position is completely useless in
many languages. Many languages just don't have the English-style
"character", see collation, hungul conjoined, combining characters.
There is just no easy way to keep track of character position.
What you really meant was probably the codepoint position.
The codepoint position is largely internal to library.
As long as regular expression can efficiently handle utf-8,
(as it does now), most people will feel just fine with it.

There are just not many people interested in the codepoint
position, if they ever heard of it. They care more about
m// or s///.

Even you want to keep track the character offsets, it is still much
easier than many other unicode features I mentioned.

Hong

Reply via email to