On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleev wrote:
You don't need to do that to slice a string. I think you mean to say that you need to decode each character if you want to slice the string at the N-th code point? But this is exactly what I'm trying to point out: how would you find this N? How would you know if it makes sense, taking into account combining characters, and all the other complexities of Unicode?
Slicing a string implies finding the N-th code point, what other way would you slice and have it make any sense? Finding the N-th point is much simpler with a constant-width encoding.

I'm leaving aside combining characters and those intrinsic language complexities baked into unicode in my previous analysis, but if you want to bring those in, that's actually an argument in favor of my encoding. With my encoding, you know up front if you're using languages that have such complexity- just check the header- whereas with a chunk of random UTF-8 text, you cannot ever know that unless you decode the entire string once and extract knowledge of all the languages that are embedded.

For another similar example, let's say you want to run toUpper on a multi-language string, which contains English in the first half and some Asian script that doesn't define uppercase in the second half. With my format, toUpper can check the header, then process the English half and skip the Asian half (I'm assuming that the substring indices for each language would be stored in this more complex header). With UTF-8, you have to process the entire string, because you never know what random languages might be packed in there.

UTF-8 is riddled with such performance bottlenecks, all to make if self-synchronizing. But is anybody really using its less compact encoding to do some "self-synchronized" integrity checking? I suspect almost nobody is.

If you want to split a string by ASCII whitespace (newlines, tabs and spaces), it makes no difference whether the string is in ASCII or UTF-8 - the code will behave correctly in either case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer to decode while splitting, when compared to a single-byte encoding.

You cannot honestly look at those multiple state diagrams and tell me it's "simple."

I meant that it's simple to implement (and adapt/port to other languages). I would say that UTF-8 is quite cleverly designed, so I wouldn't say it's simple by itself.
Perhaps, maybe decoding is not so bad for the type of people who write the fundamental UTF-8 libraries. But implementation does not merely refer to the UTF-8 libraries, but also all the code that tries to build on it for internationalized apps. And with all the unnecessary additional complexity added by UTF-8, wrapping the average programmer's head around this mess likely leads to as many problems as broken code pages implementations did back in the day. ;)

Reply via email to