On Fri, Jul 20, 2012 at 1:31 PM, Mark Davis ☕ <[email protected]> wrote: > I put together some notes on different ways for programming languages to > handle Unicode at a low level. Comments welcome. > Macchiato » > Many programming languages (and most modern software) have moved to Unicode > model of text. Text coming into the system might be in legacy encodings like > Shift-JIS or Latin-1, and text being pushed out...
I had a few comments for general discussion: "That means that it is best to optimize for BMP characters (and as a subset, ASCII and Latin-1), and fall into a ‘slow path’ when a supplementary character is encountered." I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of developers to test speed with the English/European documents they have around and test correctness only with Chinese. I see the argument in theory and practice, but it's a tough line to walk, especially if you're not familiar with i18n. I can see for i in range (1, 1000) do a := " "; a +:= "龜"; done being way slower than necessary (especially for non-trivially optimized away cases), for example. "Interfacing with most software libraries can avoid conversions in and out" I'm curious about this. I won't dismiss it off hand, but besides ICU, what libraries are we talking about that haven't already been rewritten for GTK, Java, Python, take your pick. "The string class is indexed by code unit, and is UTF-32. Used by: glibc?" I haven't poked at it, but Ada 2012 (in pre-standard editorial-changes only stage) has Latin-1, UCS-2 (the standard is not clear here about UTF-16 vs. UCS-2) and UTF-32 (UCS-4--it mentions 2147483648 code points) strings. There are functions in the standard to store a Unicode string in the Latin-1 strings as UTF-8 and in the UCS-2 strings as UTF-16, but there is a choice to use straight UTF-32. "The question of whether to allow non-ASCII characters in variables is open." I don't see why. Yes, a lot of organizations will use ASCII only, but not all programming is done large international organizations. For personal hacking, or small mononational organizations, Unicode variables may be much more convenient. It's not like Chinese variables with Chinese comments is going to be much harder to debug for the English speaker then English variables (or bad English variables) with Chinese comments, and ASCII-romanized Chinese variables may be the worst of all worlds. -- Kie ekzistas vivo, ekzistas espero.

