On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev wrote:
Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable.
Combining characters are examples of complexity baked into the various languages, so there's no way around that. I'm arguing against layering more complexity on top, through UTF-8.

Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8?
Let's take one you listed above, slicing a string. You have to either translate your entire string into UTF-32 so it's constant-width, which is apparently what Phobos does, or decode every single UTF-8 character along the way, every single time. A constant-width, single-byte encoding would be much easier to slice, while still using at most half the space.

Also, I don't think this has been posted in this thread. Not sure if it answers your points, though:

http://www.utf8everywhere.org/
That seems to be a call to using UTF-8 on Windows, with a lot of info on how best to do so, with little justification for why you'd want to do so in the first place. For example,

"Q: But what about performance of text processing algorithms, byte alignment, etc?

A: Is it really better with UTF-16? Maybe so."

Not exactly a considered analysis of the two. ;)

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
You cannot honestly look at those multiple state diagrams and tell me it's "simple." That said, the difficulty of _using_ UTF-8 is a much bigger than problem than implementing a decoder in a library.

Reply via email to