Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 02:45:36 -0700

On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleevwrote:

Another thing I noticed: sometimes when you think you reallyneed to operate on individual characters (and that your codewill not be correct unless you do that), the assumption will beincorrect due to the existence of combining characters inUnicode. Two of the often-quoted use cases of working onindividual code points is calculating the string width(assuming a fixed-width font), and slicing the string - both ofthese will break with combining characters if those are notaccounted for. I believe the proper way to approach such tasksis to implement the respective Unicode algorithms for it, whichI believe are non-trivial and for which the relative impact forthe overhead of working with a variable-width encoding isacceptable.

Combining characters are examples of complexity baked into thevarious languages, so there's no way around that. I'm arguingagainst layering more complexity on top, through UTF-8.

Can you post some specific cases where the benefits of aconstant-width encoding are obvious and, in your opinion, makeconstant-width encodings more useful than all the benefits ofUTF-8?

Let's take one you listed above, slicing a string. You have toeither translate your entire string into UTF-32 so it'sconstant-width, which is apparently what Phobos does, or decodeevery single UTF-8 character along the way, every single time. Aconstant-width, single-byte encoding would be much easier toslice, while still using at most half the space.

Also, I don't think this has been posted in this thread. Notsure if it answers your points, though:
http://www.utf8everywhere.org/

That seems to be a call to using UTF-8 on Windows, with a lot ofinfo on how best to do so, with little justification for whyyou'd want to do so in the first place. For example,

"Q: But what about performance of text processing algorithms,byte alignment, etc?


A: Is it really better with UTF-16? Maybe so."

Not exactly a considered analysis of the two. ;)

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

You cannot honestly look at those multiple state diagrams andtell me it's "simple." That said, the difficulty of _using_UTF-8 is a much bigger than problem than implementing a decoderin a library.

Re: Why UTF-8/16 character encodings?

Reply via email to