Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 04:10:27 -0700

On Saturday, 25 May 2013 at 10:33:12 UTC, Vladimir Panteleevwrote:

You don't need to do that to slice a string. I think you meanto say that you need to decode each character if you want toslice the string at the N-th code point? But this is exactlywhat I'm trying to point out: how would you find this N? Howwould you know if it makes sense, taking into account combiningcharacters, and all the other complexities of Unicode?

Slicing a string implies finding the N-th code point, what otherway would you slice and have it make any sense? Finding the N-thpoint is much simpler with a constant-width encoding.

I'm leaving aside combining characters and those intrinsiclanguage complexities baked into unicode in my previous analysis,but if you want to bring those in, that's actually an argument infavor of my encoding. With my encoding, you know up front ifyou're using languages that have such complexity- just check theheader- whereas with a chunk of random UTF-8 text, you cannotever know that unless you decode the entire string once andextract knowledge of all the languages that are embedded.

For another similar example, let's say you want to run toUpper ona multi-language string, which contains English in the first halfand some Asian script that doesn't define uppercase in the secondhalf. With my format, toUpper can check the header, then processthe English half and skip the Asian half (I'm assuming that thesubstring indices for each language would be stored in this morecomplex header). With UTF-8, you have to process the entirestring, because you never know what random languages might bepacked in there.

UTF-8 is riddled with such performance bottlenecks, all to makeif self-synchronizing. But is anybody really using its lesscompact encoding to do some "self-synchronized" integritychecking? I suspect almost nobody is.

If you want to split a string by ASCII whitespace (newlines,tabs and spaces), it makes no difference whether the string isin ASCII or UTF-8 - the code will behave correctly in eithercase, variable-width-encodings regardless.

Except that a variable-width encoding will take longer to decodewhile splitting, when compared to a single-byte encoding.

You cannot honestly look at those multiple state diagrams andtell me it's "simple."
I meant that it's simple to implement (and adapt/port to otherlanguages). I would say that UTF-8 is quite cleverly designed,so I wouldn't say it's simple by itself.

Perhaps, maybe decoding is not so bad for the type of people whowrite the fundamental UTF-8 libraries. But implementation doesnot merely refer to the UTF-8 libraries, but also all the codethat tries to build on it for internationalized apps. And withall the unnecessary additional complexity added by UTF-8,wrapping the average programmer's head around this mess likelyleads to as many problems as broken code pages implementationsdid back in the day. ;)

Re: Why UTF-8/16 character encodings?

Reply via email to