On Saturday, 25 May 2013 at 08:58:57 UTC, Vladimir Panteleev
wrote:
Another thing I noticed: sometimes when you think you really
need to operate on individual characters (and that your code
will not be correct unless you do that), the assumption will be
incorrect due to the existence of combining characters in
Unicode. Two of the often-quoted use cases of working on
individual code points is calculating the string width
(assuming a fixed-width font), and slicing the string - both of
these will break with combining characters if those are not
accounted for. I believe the proper way to approach such tasks
is to implement the respective Unicode algorithms for it, which
I believe are non-trivial and for which the relative impact for
the overhead of working with a variable-width encoding is
acceptable.
Combining characters are examples of complexity baked into the
various languages, so there's no way around that. I'm arguing
against layering more complexity on top, through UTF-8.
Can you post some specific cases where the benefits of a
constant-width encoding are obvious and, in your opinion, make
constant-width encodings more useful than all the benefits of
UTF-8?
Let's take one you listed above, slicing a string. You have to
either translate your entire string into UTF-32 so it's
constant-width, which is apparently what Phobos does, or decode
every single UTF-8 character along the way, every single time. A
constant-width, single-byte encoding would be much easier to
slice, while still using at most half the space.
Also, I don't think this has been posted in this thread. Not
sure if it answers your points, though:
http://www.utf8everywhere.org/
That seems to be a call to using UTF-8 on Windows, with a lot of
info on how best to do so, with little justification for why
you'd want to do so in the first place. For example,
"Q: But what about performance of text processing algorithms,
byte alignment, etc?
A: Is it really better with UTF-16? Maybe so."
Not exactly a considered analysis of the two. ;)
And here's a simple and correct UTF-8 decoder:
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
You cannot honestly look at those multiple state diagrams and
tell me it's "simple." That said, the difficulty of _using_
UTF-8 is a much bigger than problem than implementing a decoder
in a library.