> ...and because of this you can't randomly access the string, you are
> reduced to sequential access (*). And here I thought we could have
> left tape drives to the last millennium.
>
> (*) Yes, of course you could cache your sequential access so you only
> need to do it once, and build balanced trees and whatnot out of those
> offsets to have random access emulated in O(n lg n), but as soon as
> you update the string, you have to update the tree, or whatever data
> structure you chose. Pain, pain, pain.
People in Japan/China/Korea have been using multi-byte encoding for
long time. I personally have used it for more 10 years. I never feel
much of the "pain". Do you think I are using my computer with O(n)
while you are using it with O(1)? There are 100 million people using
variable-length encoding!!!
Take this example, in Chinese every character has the same width, so
it is very easy to format paragraphs and lines. Most English web pages
are rendered using "Times New Roman", which is a variable-width font.
Do you think the English pages are rendered O(n) while Chinese page
are rendered O(1)?
As I said there are many more hard problems than UTF-8. If you want
to support i18n and l10n, you have to live with it. If not, just
forget about the whole thing.
Hong