Re: Why UTF-8/16 character encodings?

Vladimir Panteleev Sat, 25 May 2013 02:00:58 -0700

On Saturday, 25 May 2013 at 07:33:15 UTC, Joakim wrote:

This is more a problem with the algorithms taking the easy waythan a problem with UTF-8. You can do all the stringalgorithms, including regex, by working with the UTF-8directly rather than converting to UTF-32. Then the algorithmswork at full speed.
I call BS on this. There's no way working on a variable-widthencoding can be as "full speed" as a constant-width encoding.Perhaps you mean that the slowdown is minimal, but I doubt thatalso.

For the record, I noticed that programmers (myself included) thathad an incomplete understanding of Unicode / UTF exaggerate thispoint, and sometimes needlessly assume that their code needs tooperate on individual characters (code points), when it is infact not so - and that code will work just fine as if it waswritten to handle ASCII. The example Walter quoted (regex -assuming you don't want Unicode ranges or case-insensitivity) isone such case.

Another thing I noticed: sometimes when you think you really needto operate on individual characters (and that your code will notbe correct unless you do that), the assumption will be incorrectdue to the existence of combining characters in Unicode. Two ofthe often-quoted use cases of working on individual code pointsis calculating the string width (assuming a fixed-width font),and slicing the string - both of these will break with combiningcharacters if those are not accounted for. I believe the properway to approach such tasks is to implement the respective Unicodealgorithms for it, which I believe are non-trivial and for whichthe relative impact for the overhead of working with avariable-width encoding is acceptable.

Can you post some specific cases where the benefits of aconstant-width encoding are obvious and, in your opinion, makeconstant-width encodings more useful than all the benefits ofUTF-8?

Also, I don't think this has been posted in this thread. Not sureif it answers your points, though:


http://www.utf8everywhere.org/

And here's a simple and correct UTF-8 decoder:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Re: Why UTF-8/16 character encodings?

Reply via email to