Re: string encoding

Jarkko Hietaniemi Thu, 15 Feb 2001 15:40:41 -0800
On Thu, Feb 15, 2001 at 11:16:29PM +0000, Simon Cozens wrote:
> On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
> > Personally I like the UTF-8 encoding. The solution to the
> > variable length can be handled by a special (virtual)
> > function like
> 
> I'm expecting that the virtual, internal representation will not
> be in a UTF but will simply be an array of codepoints. Manipulating
> UTF8 internally is horrible because it's a variable length encoding,
> so you need to keep track of where you are both in terms of characters
> and bytes. Yuck, yuck, yuck.

...and because of this you can't randomly access the string, you are
reduced to sequential access (*).  And here I thought we could have
left tape drives to the last millennium.

(*) Yes, of course you could cache your sequential access so you only
need to do it once, and build balanced trees and whatnot out of those
offsets to have random access emulated in O(n lg n), but as soon as
you update the string, you have to update the tree, or whatever data
structure you chose.  Pain, pain, pain.

> -- 
> Calm down, it's *only* ones and zeroes.

I wish more people would keep this in mind.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen
Re: string encoding

Reply via email to