On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:
This is more a problem with the algorithms taking the easy way than a problem with UTF-8. You can do all the string algorithms, including regex, by working with the UTF-8 directly rather than converting to UTF-32. Then the algorithms work at full speed.
I call BS on this. There's no way working on a variable-width encoding can be as "full speed" as a constant-width encoding. Perhaps you mean that the slowdown is minimal, but I doubt that also.

For the record, I noticed that programmers (myself included) that had an incomplete understanding of Unicode / UTF exaggerate this point, and sometimes needlessly assume that their code needs to operate on individual characters (code points), when it is in fact not so - and that code will work just fine as if it was written to handle ASCII. The example Walter quoted (regex - assuming you don't want Unicode ranges or case-insensitivity) is one such case.

Another thing I noticed: sometimes when you think you really need to operate on individual characters (and that your code will not be correct unless you do that), the assumption will be incorrect due to the existence of combining characters in Unicode. Two of the often-quoted use cases of working on individual code points is calculating the string width (assuming a fixed-width font), and slicing the string - both of these will break with combining characters if those are not accounted for. I believe the proper way to approach such tasks is to implement the respective Unicode algorithms for it, which I believe are non-trivial and for which the relative impact for the overhead of working with a variable-width encoding is acceptable.

Can you post some specific cases where the benefits of a constant-width encoding are obvious and, in your opinion, make constant-width encodings more useful than all the benefits of UTF-8?

Also, I don't think this has been posted in this thread. Not sure if it answers your points, though:

http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Reply via email to