On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev
wrote:
You don't need to do that to slice a string. I think you mean
to say that you need to decode each character if you want to
slice the string at the N-th code point? But this is exactly
what I'm trying to point out: how would you find this N? How
would you know if it makes sense, taking into account combining
characters, and all the other complexities of Unicode?
Slicing a string implies finding the N-th code point, what other
way would you slice and have it make any sense? Finding the N-th
point is much simpler with a constant-width encoding.
I'm leaving aside combining characters and those intrinsic
language complexities baked into unicode in my previous analysis,
but if you want to bring those in, that's actually an argument in
favor of my encoding. With my encoding, you know up front if
you're using languages that have such complexity- just check the
header- whereas with a chunk of random UTF-8 text, you cannot
ever know that unless you decode the entire string once and
extract knowledge of all the languages that are embedded.
For another similar example, let's say you want to run toUpper on
a multi-language string, which contains English in the first half
and some Asian script that doesn't define uppercase in the second
half. With my format, toUpper can check the header, then process
the English half and skip the Asian half (I'm assuming that the
substring indices for each language would be stored in this more
complex header). With UTF-8, you have to process the entire
string, because you never know what random languages might be
packed in there.
UTF-8 is riddled with such performance bottlenecks, all to make
if self-synchronizing. But is anybody really using its less
compact encoding to do some "self-synchronized" integrity
checking? I suspect almost nobody is.
If you want to split a string by ASCII whitespace (newlines,
tabs and spaces), it makes no difference whether the string is
in ASCII or UTF-8 - the code will behave correctly in either
case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode
while splitting, when compared to a single-byte encoding.
You cannot honestly look at those multiple state diagrams and
tell me it's "simple."
I meant that it's simple to implement (and adapt/port to other
languages). I would say that UTF-8 is quite cleverly designed,
so I wouldn't say it's simple by itself.
Perhaps, maybe decoding is not so bad for the type of people who
write the fundamental UTF-8 libraries. But implementation does
not merely refer to the UTF-8 libraries, but also all the code
that tries to build on it for internationalized apps. And with
all the unnecessary additional complexity added by UTF-8,
wrapping the average programmer's head around this mess likely
leads to as many problems as broken code pages implementations
did back in the day. ;)